Lecture Notes in 
Computer Science 1808 



Santosh Pande Dharma P. Agrawal (Eds.) 



Compiler Optimizations 
for Scalable 
Parallel Systems 

Languages, Compilation Techniques, 
and Run Time Systems 




Springer 




Lecture Notes in Computer Science 1808 

Edited by G. Goos, J. Hartmanis and J. van Leeuwen 




Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 




Santosh Pande Dharma R Agrawal (Eds.) 



Languages, Compilation Techniques, 
and Run Time Systems 




Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 

Volume Editors 
Santosh Pande 

Georgia Institute of Technology, College of Computing 
801 Atlantic Drive, Atlanta, GA 30332, USA 
E-mail: santosh@cc.gatech.edu 
Dharma P. Agrawal 

University of Cincinnati, Department of ECECS 
P.O. Box 210030, Cincinnati, OH 45221-0030, USA 
E-mail: dpa@ececs.uc.edu 

Cataloging-in-Publication Data applied for 

Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Compiler optimizations for scalable parallel systems : languages, 
compilation techniques, and run time systems / Santosh Pande ; Dharma 
P. Agrawal (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong 
Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2001 
(Lecture notes in computer science ; 1808) 

ISBN 3-540-41945-4 



CR Subject Classification (1998): D.3, D.4, D.1.3, C.2, E.1.2, F.3 
ISSN 0302-9743 

ISBN 3-540-41945-4 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microHlms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 
a member of BertelsmannSpringer Science+Business Media GmbH 

http://www.springer.de 

© Springer-Verlag Berlin Heidelberg 2001 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by Boiler Mediendesign 
Printed on acid-free paper SPIN: 10720238 06/3142 5 4 3 2 10 




Preface 



Santosh Pande^ and Dharma P. Agrawal^ 

^ College of Computing 
801 Atlantic Drive, 

Georgia Institute of Technology, 

Atlanta, GA 30332 
^ Department of ECECS, ML 0030, 

PO Box 210030, 

University of Cincinnati, 

Cincinnati, OH 45221-0030 



We are very pleased to publish this monograph on Compiler Optimizations 
for Scalable Distributed Memory Systems. Distributed memory systems o=er 
a challenging model of computing and pose fascinating problems regarding 
compiler optimizations ranging from language design to run time systems. 
Thus, the research done in this area serves as foundational to many chal- 
lenges from memory hierarchy optimizations to communication optimizations 
encountered in both stand-alone and distributed systems. It is with this mo- 
tivation that we present a compendium of research done in this area in the 
form of this monograph. 

This monograph is divided into Bve sections : section one deals with lan- 
guages, section two deals with analysis, section three with communication 
optimizations, section four with code generation, and section Bve with run 
time systems. In the editorial we present a detailed summary of each of the 
chapters in these sections. 

We would like to express our sincere thanks to many who contributed 
to this monograph. First we would like to thank all the authors for their 
excellent contributions which really make this monograph one of a kind; as 
readers will see, these contributions make the monograph thorough and in- 
sightful (for an advanced reader) as well as highly readable and pedagogic (for 
students and beginners). Next, we would like to thank our graduate student 
Haixiang He for all his help in organizing this monograph and for solving 
latex problems. Finally we express our sincere thanks to the LNCS Editorial 
at Springer- Verlag for putting up with our schedule and for all their help and 
understanding. Without their invaluable help we would not have been able 
to put this monograph into its beautiful Bnal shape!!! We sincerely hope the 
readers End the monograph truly useful in their work be it further research 
or practice. 
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1. Compiling for Distributed Memory Multiprocessors 

1.1 Motivation 

The distributed memory parallel systems o6er elegant architectural solutions 

for highly parallel data intensive applications primarily because: 

They are highly scalable. These systems currently come in a variety of 
architectures like 3D torus, mesh and hypercube that allow addition of 
extra processors should the computing demands increase. Scalability is an 
important issue especially for high performance servers such as parallel 
video servers, data mining and imaging applications. 

With increase in parallelism, there is insigniObant degradation in mem- 
ory performance since memories are isolated and decoupled from direct 
accesses from processors. This is especially good for data intensive applica- 
tions such as parallel databases and data mining that demand considerable 
memory bandwidths. In contrast, the memory bandwidths may not match 
the increase in number of processors in shared memory systems. In fact, 
the overall system performance may degrade due to increased memory con- 
tention. This in turn jeopardizes scalability of application beyond a point. 
Spatial parallelism in large applications such as Fluid Flow, Weather Mod- 
eling and Image Processing, in which the problem domains are perfectly 
decomposable, is easy to map on these systems. The achievable speedups 
are almost linear and this is primarily due to fast accesses to the data 
maintained in local memory. 

The interprocessor communication speeds and bandwidths have dramati- 
cally improved due to very fast routing. The performance ratings o6ered 
by newer distributed memory systems have improved although they are 
not comparable to shared memory systems in terms of MGbps. 

Medium grained parallelism can be eGectively mapped onto the newer sys- 
tems like the Meiko CS-2, Cray T3D, IBM SP1/SP2 and EM4 due to a 
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low ratio of communication/computation speeds. Communication bottle- 
neck has decreased compared with earlier systems and this has opened up 
parallelization of newer applications. 

1.2 Complexity 

However, programming distributed memory systems remains very complex. 
Most of the current solutions mandate that the users of such machines must 
manage the processor allocation, data distribution and inter-processor com- 
munication in their parallel programs. Programming these systems for achiev- 
ing the desired high performance is very complex. In spite of frantic demands 
by programmers, current solutions provided by (semi-automatic) parallelizing 
compilers are rather constrained. As a matter of fact, for many applications 
the only practical success has been through hand parallelization of codes with 
communication managed through MPI. In spite of a tremendous amount of 
research in this area, applicability of many of the compiler techniques re- 
mains rather limited and the achievable performance enhancement remains 
less than satisfactory. The main reasons for the restrictive solutions oGered 
by parallelizing compilers is the enormous complexity of the problem. Orches- 
trating computation and communication by suitable analysis and optimizing 
their performance through judicious use of underlying architectural features 
demands a true sophistication on the part of the compiler. It is not even 
clear whether these complex problems are solvable within the realm of com- 
piler analysis and sophisticated restructuring transformations. Perhaps they 
are much deeper in nature and go right into the heart of design of parallel 
algorithms for such an underlying model of computation. 

The primary purpose of this monograph is to provide an insight into cur- 
rent approaches and point to potentially open problems that could have an 
impact. The monograph is organized in terms of issues ranging from pro- 
gramming paradigms (languages) to eGective run time systems. 

1.3 Outline of the Monograph 

Language design is largely a matter of legacy and language design for dis- 
tributed memory systems is no exception to the rule. In section I of the 
monograph we examine three important approaches (one imperative, one 
object-oriented and one functional) in this domain that have made a sig- 
ni(£fcant impact. The fl'st chapter on HPF 2.0 provides an in-depth view of 
data parallel language which evolved from Fortran 90. They present HPF 
1.0 features such as BLOCK distribution and FDRALL loop as well as new fea- 
tures in HPF 2.0 such as INDIRECT distribution and ON directive. They also 
point to the complementary nature of MPI and HPF and discuss features 
such as EXTRINSIC interface mechanism. HPF 2.0 has been a major commer- 
cial success with many vendors such as Portland roup and Applied Paral- 
lel Research providing highly optimizing compiler support which generates 
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message passing code. Many research issues especially related to supporting 
irregular computation could prove valuable to domains such as sparse matrix 
computation etc. The next chapter on Sisal 90 provides a functional view 
of implicit paralleism speci@;ation and mapping. Shared memory implemen- 
tation of Sisal is discussed, which involves optimizations such as update in 
place copy elimination etc. Sisal 90 and a distributed memory implemenata- 
tion which uses message passing are also discussed. Finally multi-threaded 
implementations of Sisal are discussed, with a focus on multi-threaded opti- 
mizations. The newer optimizations which perform memory management in 
hard-ware through dynamically scheduled multi-threaded code should really 
prove beneficial for the performance of functional languages (including Sisal) 
which have an elegant programming model. The next chapter on HPC++ 
provides an object oriented view as well as details on a library and compiler 
strategy to support HPC-I-+ level 1 release. The authors discuss interesting 
features related to multi-threading, barrier synchronization and remote pro- 
cedure invocation. They also discuss library features that are especially useful 
for scientific programming. Extensions of this work relating to newer portable 
languages such as Java is currently an active area of research. We also have 
a chapter on concurrency models of 00 paradigms. The authors speci@;ally 
address a problem called inheritance anomaly which arises when synchroniza- 
tion constraints are implemented within methods of a class and an attempt is 
made to specialize methods through inheritance mechanisms. They propose 
a solution to this problem by separating the speci@;ation of synchronization 
from the method specification. The synchronization construct is not a part 
of the method body and is handled separately. It will be interesting to study 
the compiler optimizations on this model related to strength reduction of 
barriers, and issues such as data partitioning vs. barrier synchronizations. 

In section II of the monograph, we focus on various analysis techniques. 
Parallelism detection is very important and the dCst chapter presents a very 
interesting comparative study of diGerent loop parallelization algorithms by 
Allen and Kennedy, Wolf and Lam, Darte and Vivien and by Feautrier. They 
provide comparisons in terms of their performance (ability to parallelize as 
well as quality of schedules generated for code generation) as well as complex- 
ity. The comparison also focusses on the type of dependence information avail- 
able. Further extensions could involve run-time parallelization given more 
precise dependence information. Array data-Gbw is of utmost importance in 
optimizations : both sequential as well as parallel. The tlCst chapter on array 
data-Gbw analysis examines this problem in detail and presents techniques 
for exact data Gbw as well as for approximate data Gbw. The exact solution 
is shown for static control programs. Authors also show applications to inter- 
procedural cases and some important parallelization techniques such as pri- 
vatization. Some interesting extensions could involve run-time data Gbw anal- 
ysis. The next chapter discusses interprocedural analysis based on guarded 
(predicated) array regions. This is a framework based on path-sensitive predi- 
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cated data-Q)w which provides summary information. The authors also show 
application of their work to improve array privatization based on symbolic 
propagation. Extensions of these to newer object oriented languages such as 
Java (which have clean class hierarchy and inheritance model) could be in- 
teresting since these programs really need such summary MOD information 
for performing any optimization. We @hally present a very important anal- 
ysis/optimization technique for array privatization. Array privatization in- 
volves removing memory-related dependences which have a signi@bant impact 
on communication optimizations, loop scheduling etc. The authors present 
a demand-driven data-Gbw formulation of the problem; an algorithm which 
performs single pass propagation of symbolic array expressions is also pre- 
sented. This comprehensive framework implemented in a Polaris compiler is 
making a signi(S;ant impact in improving many other related optimizations 
such as load balancing, communication etc. 

The next section is focussed on communication optimization. The com- 
munication optimization can be achieved through data (and iteration space) 
distribution, statically or dynamically. These approaches further classify into 
data and code alignment or simply interation space transformations such as 
in tiling. The communication can also be optimized in data-parallel programs 
through array region analysis. Finally one could tolerate some communication 
latency through novel techniques such as multi-threading. We have chapters 
which cover these broad range of topics about communication in depth. 

The tS:st chapter in this section focusses on tiling for cache-coherent mul- 
ticomputers. This work derives optimal tile parameters for minimal com- 
munication in loops with a" ne index expressions. The authors introduce a 
notion of data footprints and tile the iteration spaces so that the volume 
of communication is minimized. They develop an important lattice theoretic 
framework to precisely determine the sizes of data footprints which are very 
valuable not only in tiling but in many array distribution transformations. 
The next two chapters deal with the important problem of communication 
free loop partitioning. 

The second chapter in this section focusses on comparing diGerent meth- 
ods of achieving communication-free partitioning for DOALL loops. This 
chapter discusses several variants of the communication-free partitioning 
problem involving duplication or non-duplication of data, load balancing of 
iteration space and aspects such as statement level vs. loop level partitioning. 
Several aspects such as trading parallelism to avoid inter-loop data distribu- 
tion are also touched upon. Extending these techniques to broader classes of 
DOALL loops could enhance their applicability. 

The next chapter by Pingali et al. proposes a very interesting framework 
which (ll'st determines a set of constraints on data and loop iteration place- 
ment. They then determine which constraints should be left unsatisClbd to 
relax an overconstraiiied system to did a solution involving a large amount 
of parallelism. Finally, the remaining constraints are solved for data and code 
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distribution. The systematic linear algebraic framework improves over many 
ad-hoc loop partitioning approaches. 

These approaches trade parallelism for codes that allow decoupling the is- 
sues of parallelism and communication by relaxing an appropriate constraint 
of the problem. However, for many important problems such as image pro- 
cessing applications such a relaxation is not possible. That is, one must resort 
to a diGerent partitioning solution based on relative costs of communication 
and computation. In the next chapter, for solving such a probleirr, a new 
approach has been proposed to partition iteration space by determining di- 
rections which maximally cover the communication by minimally trading par- 
allelism. This approach allows mapping of general medium grained DOALL 
loops. However, the communication resulting from this iteration space par- 
titioning can not be easily aggregated without sophisticated 'pack' /'unpack' 
mechanisms present at send/receive ends. Such extensions are desirable since 
aggregating communication has as signiOcant impact as reducing the volume. 

The static data distribution and alignment typically solve the problems of 
communication on a loop nest by loop nest basis but rarely in an intraproce- 
dural scope. Most of the inter-loop nest level and interprocedural boundaries 
require dynamic data redistribution. Banerjee et al. develop techniques that 
can be used to automatically determine which data partitions are most ben- 
eftfcial over speciiSc sections of the program by accounting for redistribution 
overhead. They determine split points and phases of communication and re- 
distribution are performed at split points. 

When communication must take place, it should be optimized. Also, any 
redundancies must be captured and eliminated. Manish upta in the next 
chapter proposes a comprehensive approach for performing global (interpro- 
cedural) communication optimizations such as vectorization, PRE, coalesc- 
ing, hoisting etc. Such an interprocedural approach to communication op- 
timization is highly provable in substantially improving the performance. 
Extending this work to irregular communication could be interesting. 

Finally, we present a multi-threaded approach which could hide the com- 
munication laterrcy. Two representative applications involving bitonic sort 
and FFT are chosen and using @ie grained multi-threading on EM-X it is 
shown that multi-threading can substantially help in overlapping computa- 
tion with communication to hide latencies up to 35 %. These methods could 
be especially useful for irregular computation. 

The Oral phase of compiling for distributed memory systems involves 
solving many code generation problems. Code generation problems involve, 
determining communication generation and doing address calculation to map 
global references to local ones. The next section deals with these issues. The 
(S'st chapter presents structures and techniques for communication genera- 
tion. They focus on issues such as Gfexible computation partitioning (going 
beyond owner computes rule), communication adaptation based upon ma- 
nipulating integer sets through abstract inequalities and control Gbw simpli- 
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Cfcation based on these. One good property of this work is that it can work 
with many diGerent front ends (not just data parallel languages) and the code 
generator has more opportunities to perform low level optimizations due to 
simplified control Gbw. 

The second chapter discusses basis vector based address calculation mech- 
anisms for e* cient traversals of partitioned data. While one important issue 
of code generation is communication generation, a very important issue is to 
map global address space to local address space e“ ciently. The problem is 
complicated due to data distributions and access strides. Ramanujam et al. 
present closed form expressions for basis vectors for several cases. Using the 
closed form expressions for the basis vectors, they derive a non-unimodular 
linear transformation. 

The (Sial section is on supporting task parallelism and dynamic data 
structures. We also present a run-time system to manage irregular computa- 
tion. The l^st chapter by Darbha et al. presents a task scheduling approach 
that is optimal for many practical cases. The authors evaluate its perfor- 
mance for many practical applications such as the Bellman-Ford algorithm, 
Cholesky decomposition, the Systolic algorithm etc. They show that sched- 
ules generated by their algorithm are optimal for some cases and near optimal 
for most others. With HPF 2.0 supporting task parallelism, this could open 
up many new application domains. 

The next two chapters describe language supports for dynamic data struc- 
tures such as pointers in distributed address space. upta describes several 
extensions to C with declarations such as TREE, ARRAY, MESH to declare 
dynamic data structures. He then describes name generation and distribu- 
tion strategies for name generation and distribution strategies. Finally he 
describes support for both regular as well as irregular dynamic structures. 
The second chapter by Rogers et al. presents an approach followed in their 
Olden project which uses a distributed heap. The remote access is handled 
by software caching or computation migration. The selection of these mecha- 
nisms is done automatically through a compile time heuristic. They provide 
a data layout annotation to the programmer called local path lengths which 
allows programmers to give hints regarding expected data layout thereby (Ek- 
ing these mechanisms. Both of these chapters provide highly useful insights 
into supporting dynamic data strutures which are very important for scal- 
able domains of computation supported by these machines. Thus, these works 
should have a signiCfcant impact on future scalable applications supported by 
these systems. 

Finally, we present a run-time system called CHAOS which provides e" - 
cient support for irregular computations. Due to indirection in many sparse 
matrix computations, the communication patterns are unknown at compile 
time in these applications. Indirection patterns have to be prcproccsscd, and 
the sets of elements to be sent and received by each processor precomputed, 
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in order to optimize communication. In this work, the authors provide details 
of e' cient run time support for an inspector executor model. 

1.4 Future Directions 

The two important bottlenecks for the use of distributed memory systems 
are the limited application domains and the fact that the performance is 
less than satisfactory. The main bottleneck seems to be handling communi- 
cation. Thus, e" cient solutions must be developed. Application domains be- 
yond regular communication can be handled by supporting a general run-time 
communication model. This run-time communication model must be latency 
hiding and should give su* cient Gfexibility to the compiler to defer the hard 
decisions to run time yet allow static optimizations involving communication 
motion etc. One of the big problems compilers face is that estimating cost of 
communication is almost impossible. They can however gauge criticality (or 
relative importance) of communication. Developing such a model will allow 
compilers to more eOectively deal with issues of relative importance betwen 
computation and communication and conimunication and communication. 

Probably the best reason to use distributed memory systems is to benedi 
from scalability even though application domains and performance might be 
somewhat weaker. Thus, new research must be done in scalable code gen- 
eration. In other words, as size of the problem and number of processors 
increase, should there be a change in data/code partition or should it remain 
the same? What code generation issues are related to this? How could one 
potentially handle the hot spots that inevitably (although at much lower 
levels than shared memory systems) arise? Can one bene@; from the above 
communication model and dynamic data ownerships discussed earlier? 
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Summary. High Performance Fortran (HPF) was defined in 1993 as a portable 
data-parallel extension to Fortran. This year it was updated by the release of HPF 
version 2.0, which clarified many existing features and added a number of extensions 
requested by users. Compilers for these extensions are expected to appear beginning 
in late 1997. In this paper, we present an overview of the entire language, including 
HPF 1 features such as BLOCK distribution and the FORALL statement and HPF 2 
additions such as INDIRECT distribution and the ON directive. 



1. Introduction 

High Performance Fortran (HPF) is a language that extends standard For- 
tran by adding support for data-parallel programming on scalable parallel 
processors. The original language document, the product of an 18-month in- 
formal standardization ej ort by the High Performance Fortran Forum, was 
released in 1993. HPF 1.0 was based on Fortran 90 and was strongly inQi- 
enced by the S MD programming model that was popular in the early 90s. 
The language featured a single thread of control and a shared-memory pro- 
gramming model in which any required interprocessor communication would 
be generated implicitly by the compiler. 

n spite of widespread interest in the language, HPF was not an imme- 
diate success, suj ering from the long lead time between its delnition and 
the appearance of mature compilers and from the absence of features that 
many application developers considered essential. n response to the latter 
problem, the HPF Forum reconvened in 1995 and 1996 to produce a revised 
standard called HPF 2.0 [11]. The purpose of this paper is two-fold: 

To give an overview of the HPF 2.0 specil cation, and 

To explain (in general terms) how the language may be implemented. 

We start by giving a short history of HPF and a discussion of the components 
of the language. 

2. History and Overview of HPF 

HPF has attracted great interest since the inception of the Irst standard- 
ization ej ort in 1991. Many users had long hoped for a portable, e cient, 
high-level language for parallel programming. n the 1980 s, Geoj rey Fox 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 3-43, 2001. 
Springer-Verlag Berlin Heidelberg 2001 
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analysis of parallel programs [5,6] and other projects had identiled and pop- 
ularized data-parallel programming as one promising approach to this goal. 
The data-parallel model derived its parallelism from the observation that up- 
dates to individual elements of large data structures were often independent 
of each other. For example, successive over-relaxation techniques update ev- 
ery point of a mesh based on the (previous) values there and at adjacent 
points. This observation identil ed far more parallelism in the problem than 
could be exploited by the physical processors available. Data-parallel im- 
plementations solved this situation by dividing the data structure elements 
between the physical processors and scheduling each processor to perform 
the computations needed by its local data. Sometimes the local computa- 
tions on one processor required data from another processor. n these cases, 
the implementation inserted synchronization and/or communication to en- 
sure that the correct version of the data was used. How the data had been 
divided determined how often the processors had to interact. Therefore, the 
key intellectual step in writing a data-parallel program was to determine how 
the data could be divided to minimize this interaction: once this was done, 
inserting the synchronization and communication was relatively mechanical. 

n the late 1980 s, several research projects [2,7,8, 15‘T8, 20] and commer- 
cial compilers [12,19] designed languages to implement data-parallel program- 
ming. These projects extended sequential or functional languages to include 
aggregate operations, most notably array syntax and f orall constructs, that 
directly reCbcted data-parallel operations. Also, they added syntax for de- 
scribing data mappings, usually by specifying a high-level pattern for how 
the data would be divided among processors. Programmers were responsible 
for appropriately using these "data distribution' and "data parallel' con- 
structs appropriately. n particular, the fastest execution was expected when 
the dimension(s) that exhibited data parallelism were also distributed across 
parallel processors. Rirthermore, the best distribTition pattern was the one 
that produced the least communication; that is, the pattern that required 
the least combining of elements stored on separate processors. What the pro- 
grammer did not have to do was equally important. Data-parallel languages 
did not require the explicit insertion of synchronization and communication 
operations. This made basic programming much easier, since the user needed 
only to consider the (sequential) ordering of large-grain operations rather 
than the more complex and numerous interconnections between individual 
processors. n other words, data-parallel languages had sequential semantics; 
race conditions were not possible. The cost of this convenience was increas- 
ingly complex compilers. 

The job of the compiler and run-time system for a data-parallel language 
was to e ciently map programs onto parallel hardware. Typically, the im- 
plementation used a form of the owner- computes rule, which assigned the 
computation in an assignment statement to the processor that owned the 
left-hand side. Loops over distributed data structures, including the loops 
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implied by aggregate operations, were an important special case of this rule; 
they were strip-mined so that each processor ran over the subset of the loop 
iterations specif ed by the owner-computes rule. This strip-mining automat- 
ically divided the work between the processors. Dij erent projects developed 
various strategies for inserting communication and synchronization, ranging 
from pattern-matching [16] to dependence-based techniques [17]. Because the 
target platforms for the compilers were often distributed-memory comput- 
ers like the iPS /860, communication costs were very high. Therefore, the 
compilers expended great ej ort to reduce this cost through bundling com- 
munication [15] and using e cient collective communication primitives [16]. 
Similar techniques proved useful on a variety of platforms [8], giving further 
evidence that data-parallel languages might be widely portable. At the same 
time, the commercial onnection Machine Fortran compiler [19] was proving 
that data parallelism was feasible for expressing a variety of codes. 

Many of the best ideas for data-parallel languages were eventually incor- 
porated into Fortran dialects by the Fortran D project at Rice and Syracuse 
Universities [4], the Vienna Fortran project at the University of Vienna [3] 
and work at OMPASS, nc. [12]. Early successes there led to a Supercom- 
puting 91 birds-of-a- feather session that essentially proposed development of 
a standard data-parallel dialect of Fortran. At a follow-up meeting in Hous- 
ton the following January, the enter for Research on Parallel omputation 
( RP ) agreed to sponsor an informal standards process, and the High Per- 
formance Fortran Forum (HPFF) was formed. A "core’ group of HPFF met 
every 6 weeks in Dallas for the next year, producing a preliminary draft of the 
HPF language specif cation presented at Supercomputing ifeand the Inal 
HPF version 1.0 language specif cation early the next year [9]. The outlines 
of HPF 1.0 were very similar to its immediate predecessors: 

Fortran 90 [1] (the base language) provided immediate access to array 
arithmetic, array assignment, and many useful intrinsic functions. 

The ALIGN and DISTRIBUTE directives (structured comments recognized 
by the compiler) described the mapping of partitioned data structures. 
Section 3 describes these features in more detail. 

The FORALL statement (a new construct), the INDEPENDENT directive (an 
assertion to the compiler), and the HPF library (a set of data-parallel 
subprograms) provided a rich set of data-parallel operations. Section 4 
describes these features in more detail. 

EXTRINSIC procedures (an interface to other programming paradigms) pro- 
vided an "escape hatch' for programmers who needed access to low-lcvcl 
machine details or forms of parallelism not well-expressed by data-parallel 
constructs. Section 5 describes these functions in more detail. 

A reference to the standard [14] was published soon afterward, and HPFF 
went into recess for a time. 

^ That presentation was so overcrowded that movable walls were removed during 
the session to make a larger room. 




6 



Ken Kennedy and Charles Koelbel 



n 1994. HPFF resumed meetings with two purposes: 

To consider orrections, laril cations, and nterpretations ( ) of the 

HPF 1.0 language in response to public comments and questions, and 
To determine requirements for further extensions to HPF by consideration 
of advanced applications codes. 

The discussions led to the publication of a new language specif cation 

(HPF version 1.1). Although some of the claril cations were important for 
special cases, there were no major language modil cations. The extensions 
requirements were eolleeted in a separate document [10]. They later served 
as the basis for diseussions toward HPF 2.0. 

n January 1995, HPFF undertook its I nal (to date) series of meetings, 
with the intention of producing a signilcant update to HPF. Those meetings 
were completed in December 1996, and the HPF version 2.0 language speci- 
1 cation [11] appeared in early 1997. The basic syntax and semantics of HPF 
did not change in version 2.0: generally, programs still consisted of sequential 
compositions of aggregate operations on distributed arrays. However, there 
were some signilcant revisions: 

HPF 2.0 consists of two parts: a base language, and a set of approved 
extensions. The base language is very close to HPF 1.1, and is expected to 
be fully implemented by vendors in a relatively short time. The approved 
extensions are more advanced features which are not o cially part of the 
language, but which may be adopted in future versions of HPF. However, 
several vendors have committed to supporting one or more of the extensions 
due to customer demands. n this paper, we will refer to both parts of the 
language as "HPF 2.0' but will point out approved extensions when they 
arc introduced. 

HPF 2.0 removes, restricts, or reclassil es some features of HPF 1.1, particu- 
larly in the area of dynamic remapping of data, n all cases, the justil cation 
of these changes was that cost of implementation was much higher than 
originally thought, and did not justify the advantage gained by including 
the features. 

HPF 2.0 adds a number of features, particularly in the areas of new dis- 
tribution patterns, REDUCTION clauses in INDEPENDENT loops, the new ON 
directive for computation scheduling (including task parallelism), and asyn- 
chronous /O. 

The remainder of this paper considers the features of HPF 2.0 in more 
detail. n particular, each section below describes a cluster of related features, 
including examples of their syntax and use. We close tlie paper with a look 
to future prospects for HPF. 
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3. Data Mapping 

The most widely discussed features of HPF describe the layout of data onto 
memories of parallel machines. onceptually, the programmer gives a high- 
level description of how large data structures (typically, arrays) will be par- 
titioned between the processors. t is the compiler and run-time system s 
responsibility to carry out this partitioning. This data mapping does not 
directly create a parallel program, but it does set the stage for parallel exe- 
cution. n particular, data parallel statements operating on partitioned data 
structures can execute in parallel. We will describe that process more in sec- 
tion 4. 

3.1 Basic Language Features 

HPF uses a 2-phase data mapping model, ndividual arrays can hm.ligned to- 
gether, thus ensuring that elements aligned together are always stored on the 
same processor. This minimizes data movement if the corresponding elements 
are accessed together frequently. Once arrays are aligned in this way, one of 
them can be distributed, thus partitioning its elements across the proces- 
sors memories. This aj ects the data movement from combining dij erent ele- 
ments of the same array (or, by implication, elements of dij erent arrays that 
are not aligned together). Distribution tends to be more machine-dependent 
than alignment; dij ering cost tradeoj s between machines (for example, rela- 
tively higher bandwidth or longer latency) may dictate dij erent distribution 
patterns when porting. Section 3.1.1 below describes alignment, while Sec- 
tion 3.1.2 describes distribution. 

t bears mentioning that the data mapping features of HPF are technically 
directives that is, structured comments that are recognized by the compiler. 
The advantage of this approach is determinism: in general, an IIPF program 
will produce the same result when run on any number of processors.^ Another 
way of saying this is that HPF data mapping aj ects only the e ciency of 
the program, not its correctness. This has obvious attraction for maintaining 
and porting codes. We feel it is a key to HPF s success to date. 

3.1.1 The ALIGN Directive. The ALIGN directive creates an alignment 
between two arrays. Syntactically, the form of this directive is 

!HPF$ ALIGN alignee [ ( align- dummy-list ) ] WITH [ * ] target [ 

( align-subscript-list ) ] 



^ There are two notable exceptions to this. HPF has intrinsic functions (not 
described in this paper) that can query the mapping of an array; a programmer 
could use these to explicitly code different behaviors for different numbers of 
processors. Also, round-off errors can occur which may be sensitive to data 
mapping; this is most likely to occur if the reduction intrinsics described in 
Section 4 are applied to mapped arrays. 
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(For brevity, we ignore several alternate forms that can be reduced to this 
one.) The alignee can be any object name, but is typically an array name. 
This is the entity being aligned. The target may be an object or a template; in 
either case, this is the other end of the alignment relationship. (Templates are 
~ phantom' objects that have shape, but take no storage; they are sometimes 
useful to provide the proper sized target for an alignment.) Many alignees 
can be aligned to a single target at once. An align- dummy is a scalar integer 
variable that may be used in (at most) one of the align-subscript expressions. 
An align-subscript is an a ne linear function of one align-dummy. or it is 
a constant, or it is an asterisk (*). The align- dummy-list is optional if the 
alignee is a scalar; otherwise, the number of list entries must match the 
number of dimensions of the alignee. The same holds true for the align- 
subscript-list and target. The meaning of the optional asterisk before thetarget 
is explained in Section 3.2.3; here, it su ces to mention that it only applies 
to procedure dummy arguments. The ALIGN directive must appear in the 
declaration section of the program. 

An ALIGN directive delnes the alignment of the alignee by specifying the 
element (s) of the target that correspond to each alignee element. Each align- 
dummy implicitly ranges over the corresponding dimension of the alignee. 
Substituting these values into the target expression specif es the matching 
element (s). A used as a subscript in the target expression means the 
alignee element matches all elements in that dimension. For example. Fig- 
ure 3.1 shows the result of the HPF directives 

!HPF$ ALIGN B(I,J) WITH A(I,J) 

!HPF$ ALIGN C(I,J) WITH A(J,I) 

!HPF$ ALIGN D(K) WITH A(K,1) 

!HPF$ ALIGN E(L) WITH A(L,*) 

!HPF$ ALIGN F(M) WITH D(2*F-1) 

Elements (squares) with the same symbol are aligned together, ffere, B is 
identically aligned with A; this is by far the most common case in practice. 
Similarly, C is aligned with the transpose of A, which might be appropriate 
if one array were accessed row-wise and the other column-wise. Elements of 
D are aligned with the Irst column of A; any other column could have been 
used as well. Elements of E. however, are each aligned with all elements in a 
row of A. As we will see in Section 3.1.2, this may result in E being replicated 
on many processors when A is distributed. Finally, F has a more complex 
alignment; through the use of (nontrivial) linear functions, it is aligned with 
the odd element of D. However, D is itself aligned to A, so F is ultimately 
aligned with A. Note that, because the align- subscripts in each directive are 
linear functions, the overall alignment relationship is still an invertible linear 
function. 

The ALIGN directive produces rather I nc-grain relationships between ar- 
ray elements. Typically, this corresponds to relationships between the data 
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the ALIGN directive. 



in the underlying algorithm or in physical phenomena being simulated. For 
example, a Qiid dynamics computation might have arrays representing the 
pressure, temperature, and Qiid velocity at every point in space; because 
of their close physical relationships those arrays might well be aligned to- 
gether. Because the alignment derives from a deep connection, it tends to be 
machine-independent. That is, if two arrays are often accessed together on one 
computer, they will also be accessed together on another. This makes align- 
ment useful for software engineering. A programmer can choose one 'master' 
array to which all others will be aligned; this e| ectively copies the master s 
distribution (or a modil cation of it) to the other arrays. t also allows To 
change the distributions of all the arrays (for example, when porting to a new 
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machine), the programmer only has to adjust the distribution of the master 
array, as explained in the next section. 

3.1.2 The DISTRIBUTE Directive. The DISTRIBUTE directive delnes 
the distribution of an object or template and all other objects aligned to it. 
Syntactically, the form of this directive is 

!HPF$ DISTRIBUTE distributee [ * ] [ ( dist-format-list ) ] [ ONTO 
[ * ] [ processor [ ( section-list ) ] ] ] 



The distributee can be any object or template, but is typically an array name; 
it is the entity being distributed. The dist-format-list gives a distribution 
pattern for each dimension of the distributee. The number of dist-format-list 
entries must match the number of distributee dimensions. The ONTO clause 



identiles the processor arrangement (or, as an HPF 2.0 approved extension, 
the section thereof) where the distributee will be stored. The number of di- 
mensions of this expression must match the number of entries in the dist- 
format-list that are not *. The dist-format-list or the processor expression 
is only optional if its * option (explained in Section 3.2.3) is present. The 
DISTRIBUTE directive must appear in the declaration section of the program. 

An DISTRIBUTE directive delnes the alignment of the distributee by giv- 
ing a general pattern for how each of its dimensions will be divided. HPF 1.0 
had three such formats BLOCK, CYCLIC, and *. HPF 2.0 adds two more 
GEN_BL0CK and INDIRECT as approved extensions. Figure 3.2 shows the re- 
sults of the HPF directives 



!HPF$ DISTRIBUTE S( 
!HPF$ DISTRIBUTE T( 
!HPF$ DISTRIBUTE U( 
!HPF$ DISTRIBUTE V( 
!HPF$ DISTRIBUTE W( 
!HPF$ DISTRIBUTE X( 
!HPF$ SHADOW X(l) 
!HPF$ DISTRIBUTE Y( 
!HPF$ DISTRIBUTE Z( 



BLOCK ) 

CYCLIC ) 

CYCLIC(2) ) 

GEN_BL0CK( (/ 3, 5, 5, 3 /) ) ) 
INDIRECTC SNAKE(1:16) ) 

BLOCK ) 

BLOCK, + ) 

BLOCK, BLOCK ) 



Here, the color of an element represents the processor where it is stored. All 
of the arrays are mapped onto four processors; for the last case, the proces- 
sors are arranged as a 2 2 array, although other arrangements (1 4 and 

4 1) are possible. As shown by the S declaration, BLOCK distribution breaks 

the dimension into equal-sized contiguous pieces. ( f the size is not divisible 
by the number of processors, the block size is rounded upwards.) The T dec- 
laration shows how the CYCLIC distribution assigns the elements one-by-one 
to processors in round-robbin fashion. CYCLIC can take an integer parameter 
fc, as shown by the declaration of U; in this case, blocks of size k are assigned 
cyclically to the processors. The declaration of V demonstrates the GEN_BL0CK 
pattern, which extends the BLOCK distribution to unequal-sized blocks. The 




Fig. 3.2. Examples of the DISTRIBUTE directive. 



sizes of the blocks on each processor is given by the mandatory integer array 
argument; there must be one such element for each processor. W demonstrates 
the INDIRECT pattern, which allows arbitrary distributions to be delned by 
declaring the processor home for each element of the distributee. The size of 
the INDIRECT parameter array must be the same as the size of the distributee. 
The contents of the parameter array SNAKE are not shown in the I gure, but it 
must be set before the DISTRIBUTE directive takes ej ect. The BLOCK distribu- 
tion can be modil ed by the SHADOW directive, which allocates "overlap' areas 
on each processor. X shows how this produces additional copies of the edge 
elements on each processor; the compiler can then use these copies to opti- 
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mize data movement. Finally, multi-dimensional arrays take a distribution 
pattern in each dimension. For example, the Y array uses a BLOCK pattern for 
the rows and the * pattern (which means 'do not partition' ) for the columns. 
The Z array displays a true 2-dimensional distribution, consisting of a BLOCK 
pattern in rows and columns. 

When other objects are aligned to the distributee, its distribution pat- 
tern propagates to them. Figure 3.3 shows this process for the alignments in 
Figure 3.1. The left side shows the e| ect of the directive 



A B A B 







DISTRIBUTE A (BLOCK,*) DISTRIBUTE A(*,BLOCK) 

Fig. 3.3. ombiningALIGN and DISTRIBUTE. 



!HPF$ DISTRIBUTE A( BLOCK, * ) 

assuming three processors are available. Because the alignments are simple 
in this example, the same patterns could be achieved with the directives 

!HPF$ DISTRIBUTE A( BLOCK, * ) 

!HPF$ DISTRIBUTE B( BLOCK, * ) 

!HPF$ DISTRIBUTE C( *, BLOCK ) 

!HPF$ DISTRIBUTE D( BLOCK ) 

!HPF$ DISTRIBUTE E( BLOCK ) 

!HPF$ DISTRIBUTE F( BLOCK ) 

The right side shows the ej ect of the directive 

!HPF$ DISTRIBUTE A( *, BLOCK ) 

Elements of E are replicated on all processors; therefore, each element has 
three colors. This mapping cannot be achieved by the DISTRIBUTE directive 
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alone. The mappings of D and F onto a single processor are inconvenient to 
specify by a single directive, but it is possible using the ONTO clause. 

The DISTRIBUTE directive aj ects the granularity of parallelism and com- 
munication in a program. The keys to understanding this are to remember 
that computation partitioning is based on the location of data, and that com- 
bining elements on dij erent processors (e.g. adding them together, or assign- 
ing one to the other) produces data movement. To balance the computational 
load, the programmer must choose distribution patterns so that the updated 
elements are evenly spread across processors. f all elements are updated, 
then either the BLOCK or CYCLIC patterns do this; triangular loops some- 
times make CYCLIC the only load-balanced option. To reduce data movement 
costs, the programmer should maximize the number of on-processor accesses. 
BLOCK distributions do this for nearest-neighbor access patterns; for irregular 
accesses, it may be best to carefully calculate an INDIRECT mapping array. 

3.2 Advanced Topics 

The above forms of ALIGN and DISTRIBUTE are adequate for declaring the 
mappings of arrays whose shape and access patterns are static. Unfortunately, 
this is not the case for all arrays; Fortran ALLOCATABLE and POINTER arrays 
do not have a constant size, and subprograms may be called with varying 
actual arguments. The features in this section support more dynamic uses of 
their mapped arrays. 

3.2.1 Specialized and Generalized Distributions. n many dynamic 
cases, it is only possible to provide partial information about the data map- 
ping of an array. For example, it may be known that a BLOCK distribution 
was used, but not how many processors (or which processors) were used. 
When two mappings interact for example, when a mapped pointer is as- 
sociated with a mapped array the intuition is that the target must have a 
more fully described mapping than the incoming pointer. HPF makes this 
intuition more precise by delniiig ''generalizations' and ~ specializations' of 
mappings. n short S s mapping is a specialization oG s mapping (ofi s 
mapping is a generalization of S s) ifi is more precisely specif ed. To make 
this statement more exact, we must introduce some syntax and delnitions. 

The "lone star' syntax that appeared in the DISTRIBUTE directives indi- 
cates any valid value can be used at runtime. onsider the directives 

!HPF$ DISTRIBUTE A * ONTO P 

!HPF$ DISTRIBUTE B (BLOCK) ONTO * 

The Irst line means that A is distributed over processor arrangement P, but 
does not specify the pattern; it would be useful for declaring dummy ar- 
guments to a subroutine that would only be executed on those processors. 
Similarly, the second line declares that B has a block distribution, but docs 
not specify the processors. Either clause gives partial information about the 
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mapping of the distributee. n addition, thdNHERIT directive specif es only 
that an object has a mapping. This serves as a ‘wild card' in matching map- 
pings. t is particularly useful for passing dummy arguments that can take 
on any mapping (indeed, the name of the directive comes from the idea that 
the dummy arguments ‘inherit' the mapping of the actuals). 

HPF del nes the mapping of S to be a specialization of mapping G if: 

1. G has the INHERIT attribute, or 

2. S does not have the INHERIT attribute, and 

a) S is a named object (i.e. an array or other variable), and 

b) S and G are aligned to objects with the same shape, and 

c) The align-subscripts in S s an4 ALIGN directives reduce to identi- 
cal expressions (except for align-dummy name substitutions), and 

d) Either 

i. Neither S s noG s align target has DISTRIBUTE directive, or 

ii. Both S s an(S s align targets have DISTRIBUTE directive, and 

A. G s align target DISTRIBUTE directive has no ONTO clause, 
or specif es ‘ONTO , or specif es ONTO the same processor 
arrangement as S s align target, and 

B. G s align target DISTRIBUTE directive has no distribution 
format clause, or uses as the distribution format clause, 
or the distribution patterns in the clause are equivalent 
dimension- by-dimension to the patterns in S s align target. 

Two distribution patterns are equivalent in the defnition if they both have 
the same identifer (e.g. are both BLOCK, both CYCLIC, etc.) and any param- 
eters have the same values (e.g. the m in CYCLIC (m), or the array ind in 
INDIRECT(ind)). 

This del lies ‘specialization' as a partial ordering of mappings, with an 
INHERIT mapping as its unique minimal element. onversely, ‘generalization' 
is a partial ordering, with INHERIT as its unique maximum. 

We should emphasize that, although the defnition of ‘specialization' is 
somewhat complex, the intuition is very simple. One mapping is a specializa- 
tion of another if the specialized mapping provides at least as much informa- 
tion as the general mapping, and when both mappings provide information 
they match. For example, the arrays A and B mentioned earlier in this section 
have mappings that are specializations of 

!HPF$ INHERIT GENERAL 

f all the arrays are the same size, botfA and B are generalizations of 
!HPF$ DISTRIBUTE SPECIAL (BLOCK) ONTO P 
Neither A s noB s mapping is a specialization of the other. 




High Performance Fortran 2.0 



15 



3.2.2 Data Mapping for ALLOCATABLE and POINTER Arrays. 

One of the great advantages of Fortran 90 and 95 over previous versions 
of FORTRAN is dynamic memory management. n particularALLDCATABLE 
arrays can have their size set during execution, rather than being statically 
allocated. This generality does come at some cost in complicating HPF s 
data mapping, however. n particular, the number of elements per processor 
cannot be computed until the size of the array is known; this is a particular 
problem for BLOCK distributions, since the beginning and ending indices on 
a particular processor may depend on block sizes, and thus on the number 
of processors. n addition, it is unclear what to do if an allocatable array is 
used in a chain of AL GN directives. 

HPF resolves these issues with a few simple rules. Mapping directives 
(ALIGN and DISTRIBUTE) take ej ect when an object is allocated, not when 
it is declared. f an object is the target of arALIGN directive, then it must 
be allocated when the ALIGN directive takes ej ect. These rules enforce most 
users expectations; patterns take ej ect when the array comes into existence, 
and one cannot del ne new mappings based on ghosts. For example, consider 
the following code. 

REAL, ALLOCATABLE :: A(:), B(:), C(:) 

!HPF$ DISTRIBUTE A (CYCLIC (N)) 

!HPF$ ALIGN B(I) WITH A(I*2) 

!HPF$ ALIGN C(I) WITH A(I*2) 

ALLOCATEC B(IOOO) ) ! Illegal 

ALLOCATE ( A (1000) ) ! Legal 
ALLOCATE ( C(500) ) ! Legal 

The allocation of B is illegal, since it is aligned to an object that does not (yet) 
exist. However, the allocations of A and C are properly sequenced; C can rely 
on A, which is allocated immediately before it. A will be distributed cyclically 
in blocks of N elements (where N is evaluated on entry to the subprogram 
where these declarations occur). fJ is even, C will have blocks of size N/2; 
otherwise, its mapping will be more complex. 

t is sometimes convenient to choose the distribution based on the actual 
allocated size of the array. For example, small problems may use a CYCLIC 
distribution to improve load balance while large problems benel t from a 
BLOCK distribution. n these cases, tliREALIGN and REDISTRIBUTE directives 
described in Section 3.2.4 provide the necessary support. 

The HPF 2 core language does not allow explicitly mapped pointers, but 
the approved extensions do. n this case, theALLOCATABLE rules also apply 
to POINTER arrays. n addition, however, pointer assignment can associate a 
POINTER variable with another object, or with an array section. n this case, 
the rule is that the mapping of the POINTER must be a generalization of the 
target s mapping. For example, consider the following code. 
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REAL, POINTER :: PTR1(:), PTR2(:), PTR3(:) 

REAL, TARGET A (1000) 

!HPE$ PROCESSORS P(4), Q(8) 

!HPE$ INHERIT PTRl 

!HPE$ DISTRIBUTE PTR2 (BLOCK) 

!HPF$ DISTRIBUTE PTR3 * ONTO P 
!HPF$ DISTRIBUTE A (BLOCK) ONTO Q 

PTRl => A( 2 : 1000 : 2 ) ! Legal 

PTR2 => A ! Legal 

PTR3 => A ! Illegal 

PTRl can point to a regular section of A because it has the NHER T attribute; 
neither of the other pointers can, because regular sections are not named 
objects and thus not specializations of any other mapping. PTR2 can point 
to the whole of A; for pointers, the lack of an ONTO clause ej ectively means 
"can point to data on any processors arrangement.’ PTR3 cannot point to 
A because their ONTO clauses are not compatible; the same would be true if 
their ONTO clauses matched but their distribution patterns did not. 

The ej ect of these rules is to enforce a form of type checking on mapped 
pointers. n addition to the usual requirements that pointers match their tar- 
gets in type, rank, and shape, HPF 2.0 adds the requirement that any explicit 
mappings be compatible. That is, a mapped POINTER can only be associated 
with an object with the same (perhaps more fully specil ed) mapping. 

3.2.3 Interprocedural Data Mapping. The basic ALIGN and DISTRIBUTE 

directives declare the mapping of global variables and automatic variables. 
Dummy arguments, however, require additional capabilities. n particular, 
the following situations are possible in HPF : 

Prescriptive mapping: The dummy argument can be forced to have a par- 
ticular mapping. f the actual argument does not have this mapping, the 
mapping is changed for the duration of the procedure. 

escriptive mapping: This is the same as a prescriptive mapping, except 
that it adds an assertion that the actual argument has the same mapping 
as the dummy. f it does not, the compiler may emit a warning. 
Transcriptive mapping: The dummy argument can have any mapping. n 
ej ect, it inherits the mapping of the actual argument. 

Syntactically, prescriptive mapping is expressed by the usual ALIGN or 
DISTRIBUTE directives, as described in Section 3.1. Descriptive mapping is 
expressed by an asterisk preceding a clause of an ALIGN or DISTRIBUTE direc- 
tive. Transcriptive mappings are expressed by an asterisk in place of a clause 
in an ALIGN or DISTRIBUTE directive. t is possible for a mapping to be par- 
tially descriptive and partially transcriptive (or some other combination) by 
this del nition, but such uses are rare. For example, consider the following 
subroutine header. 
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SUBROUTINE EXAMPLE ( PRE, DESCl, DESC2, TRANSl, TRANS2, N ) 
INTEGER N 

REAL PRE(N), DESCl(N), DESC2(N), TRANSl(N), TRANS2(N) 

!HPF$ DISTRIBUTE PRE(BLOCK) ! Prescriptive 

!HPF$ DISTRIBUTE DESCl * (BLOCK) ! Descriptive 

!HPF$ ALIGN DESC2(I) WITH *DESC1 (1*2-1) ! Descriptive 

!HPF$ INHERIT TRANSl ! Transcriptive 

!HPF$ DISTRIBUTE TRANS2 * ONTO * ! Transcriptive 

PRE is prescriptively mapped; if the corresponding actual argument does not 
have a BLOCK distribution, then the data will be remapped on entry to the 
subroutine and on exit. DESCl and DESC2 are descriptively mapped; the actual 
argument for DESCl is expected to be BLOCK-distributed, and the actual for 
DESC2 should be aligned with the even elements of DESCl. f either of these 
conditions is not met, then a remapping is performed and a warning is emitted 
by the compiler. TRANSl is transitively mapped; the corresponding actual can 
have any mapping or can be an array section without causing remapping. 
TRANS2 is also transcriptively mapped, but passing an array section may 
cause remapping. 

HPF 2.0 simplil ed and claril ed the rules for when an explicit interface is 
required.^ f any argument is declared with thdNHERIT directive, or if any 
argument is remapped when the call is executed, then an explicit interface 
is required. Remapping is considered to occur if the mapping of an actual 
argument is not a specialization of the mapping of its corresponding dummy 
argument. n other words, if the dummy s mapping u^HERIT or doesn t 
describe the actual as well (perhaps in less detail), then the programmer 
must supply an explicit interface. The purpose of this rule is to ensure that 
both caller and callee have the information required to change perform the 
remapping. Because it is sometimes di cult to decide whether mappings are 
specializations of each other, some programmers prefer to simply use explicit 
interfaces for all calls: this is certainly safe. 

3.2.4 REALIGN and REDISTRIBUTE. Sometimes, remapping of ar- 
rays is required at a granularity other than subroutine boundaries. For exam- 
ple, within a single procedure an array may exhibit parallelism across rows 
for several loops, then parallelism across columns. Another common example 
is choosing the distribution of an array based on runtime analysis, such as 
computing the parameter array for a later INDIRECT distribution. For these 
cases, HPF 2.0 approved extensions provide the REALIGN and REDISTRIBUTE 
directives. t is worth noting that these directives were part of HPF version 
1.0, but reclassiled as approved extensions in HPF 2.0 due to unforeseen 
di culties in their implementation. 



^ Fortran 90 introduced the concept of explicit interfaces, which give the caller 
all information about the types of dummy arguments. Explicit interfaces are 
created by INTERFACE blocks and other mechanisms. 
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Syntactically, the REALIGN directive is identical to the ALIGN directive, 
except for two additional characters in the keyword. (Also, REALIGN does not 
require the descriptive and transcriptive forms of ALIGN since its purpose is 
always to change the data mapping.) Similarly, REDISTRIBUTE has the same 
syntax as DISTRIBUTE s prescriptive case. Semantically, both directives set 
the mapping for the arrays they name when the program control Cbw reaches 
them; in a sense, they act like executable statements in this regard. The new 
mapping will persist until the array becomes deallocated, or another REALIGN 
or REDISTRIBUTE directive is executed. Data in the remapped arrays must be 
communicated to its new home unless the compiler can detect that the data 
is not live. One special case of dead data is called out in the HPF language 
specif cation a REALIGN or REDISTRIBUTE directive for an array immediately 
following an ALLOCATE statement for the same array. The HPFF felt this was 
such an obvious and common case that strong advice was given to the vendors 
to avoid data motion there. 

There is one asymmetry between REALIGN and REDISTRIBUTE that bears 
mentioning. REALIGN only changes the mapping of its alignee; the new map- 
ping does not propagate to any other arrays that might be aligned with it be- 
forehand. REDISTRIBUTE of an array changes the mapping for the distributee 
and all arrays that are aligned to it, following the usual rules for ALIGN. The 
justil cation for this behavior is that both "remap all' and "remap one' be- 
haviors are needed in dij erent algorithms. (The choice to make REDISTRIBUTE 
rather than REALIGN propagate to other arrays was somewhat arbitrary, but 
1 1 naturally with the detailed del nitions of alignment and distribution in the 
language specif cation.) 

The examples in Figure 3.4 may be helpful. n the assignments tcA, that 
array is f rst computed from corresponding elements of C and their verti- 
cal neighbors, then updated from C s transpose. learly, the communication 
patterns are dij erent in these two operations; use of REALIGN allows both 
assignments to be completed without communication. (Although the com- 
munication here occurs in the REALIGN directives instead, a longer program 
could easily show a net reduction in communication.) n the operations oiB, 
corresponding elements of B and D are multiplied in both loops; this implies 
that the two arrays should remain identically aligned. However, each loop 
only exhibits one dimension of parallelism; using REDISTRIBUTE as shown 
permits the vector operations to be executed fully in parallel in each loop, 
while any static distribution would sacrif ce parallel execution in one or the 
other. 



4. Data Parallelism 

Although the data mapping features of HPF are vital, particularly on dis- 
tributed memory architectures, they must work in concert with data-parallel 
operations to fully use the machine. onceptually, data-parallel loops and 
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REAL A(100,100), B(100,100), C(100,100), D(100,100) 

!HPF$ DYNAMIC A, B 

!HPF$ DISTRIBUTE C (BLOCK,*) 

!HPF$ ALIGN D(I,J) WITH B(I,J) 

!HPF$ REALIGN A(I,J) WITH C(I,J) 

A = C + CSHIFT(C,1,2) + CSHIFT(C , -1 , 2) 

!HPF$ REALIGN A(I,J) WITH C(J,I) 

A = A + TRANSPOSE (C) 

!HPF$ REDISTRIBUTE B(*, BLOCK) 

DO I = 2, N-1 

B(I,:) = B(I-1, :)*D(I-1, :) + B(I,:) 

END DO 

!HPF$ REDISTRIBUTE B(*, BLOCK) 

DO J = 2, N-1 

B(:,J) = B(: ,J-1)*D(: ,J-1) + B(:,J-1) 

END DO 



Fig. 3.4. REALIGN and REDISTRIBUTE 



functions identify masses of operations that can be executed in parallel, typi- 
cally element-wise updates of data. The compiler and run-time system use the 
data mapping information to package this potential parallelism for the phys- 
ical machine. Therefore, as a I rst-order approximation programmers should 
expect that optimal performance will occur for vector operations along a 
distributed dimension. This will be true modulo communication and syn- 
chronization costs, and possibly implementation shortcomings. 

4.1 Basic Language Features 

Four features make up the basic HPF support for data parallelism: 

1. Fortran 90 array assignments delne element-wise operations on regular 
arrays. 

2. The FDRALL statement is a new form of array assignment that provides 
greater Cbxibility. 

3. The INDEPENDENT assertion is a directive (i.e. structured comment) that 
gives the compiler more information about a DO loop. 

4. The HPF library is a set of useful functions that perform parallel opera- 
tions on arrays. 

All of these features are part of the HPF 2.0 core language; Section 4.2 will 
consider additional topics from the approved extensions. We will not cover 
array assignments in more detail, except to say that they formed an invaluable 
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base to build HPF s more general operationS'ORALL is the I rst case that we 
have discussed of a new language construct; in contrast to the data mapping 
directives, it changes the values of program data. Because of this, it could 
not be another directive. INDEPENDENT, on the other hand, is a directive; if 
correctly used, it only provides information to the compiler and does not alter 
the meaning of the program. Finally, the HPF library is a set of functions 
that provide interesting parallel operations. n most cases, these operations 
derive their parallelism from independent operations on large data sets, but 
the operations do not occur element-wise as in array assignments. 

ompilers may augment the explicit data-parallel features of HPF by ana- 
lyzing DO loops and other constructs for parallelism, n fact, many do precisely 
that. Doing so is certainly a great advantage for users who are porting exist- 
ing code. Unfortunately, dij erent compilers have dij erent capabilities in this 
regard. For portable parallelism, it is often best to use the explicit constructs 
described below. 

4.1.1 The FORALL Statement. The FORALL statement provides data- 
parallel operations, much like array assignments, with an explicit index space, 
much like a DO loop. There are both single-statement and multi-statement 
forms of the FORALL. The single-statement form has the syntax 

FORALL i forall-triplet-list [, mask-expr ] ) forall- assignment- stmt 

The multi-statement form has the syntax 

FORALL ( forall-triplet-list [, mask-expr ] ) 

[ forall-body- construct ] ... 

END FORALL 

n both cases, ^orall-triplet is 

index-name = int-expr : int-expr [ : int-expr ] 

i sforall-triplet-list has more than one triplet, then no index-nam,e may 
be used in the bounds or stride for any other index. A forall- assignm,ent- 
stmt is either an assignment statement or a pointer assignment statement. 

A forall-body-construct can be an assignment, pointer assignment. WHERE, or 
FORALL statement. For both forall- assignment- stmt and forall-body-construct, 
function calls are restricted to PURE functions; as we will see in Section 4.1.2, 
these are guaranteed to have no side ej ects. 

The semantics of a single-statement FORALL is essentially the same as 
for a single array assignment. First, the bounds and strides in the FORALL 
header are evahiated. These determine the range for each index to iterate 
over; for multidimensional FORALL statements, the indices are combined by 
artesian product. Next, the mask expression is evahiated for every 'iter- 
ation' in range. The FORALL body will not be executed for iterations that 
produce a false mask. The right-hand side of the assignment is computed for 
every remaining iteration. The key to parallel execution of the FORALL is that 
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these computations can be performed in parallel no data is modil ed at this 
point, so there can be no interference between dij erent iterations. Finally, 
all the results are assigned to the left-hand sides. t is an error if two of the 
iterations produce the same left-hand side location. Absent that error, the 
assignments can be made in parallel since there are no other possible sources 
of interference. 

The semantics of a multi-statement FORALL reduce to a series of single- 
statement FDRALLs, one per statement in the body. That is, the bounds and 
mask are computed once at the beginning of execution. Then each statement 
is executed in turn, I rst computing all right-hand sides and then assigning to 
the left-hand sides. After all assignments are complete, execution moves on 
to the next body statement. f the body statement is anotheFORALL, then 
the inner bounds must be computed (and may be dij erent for every outer 
iteration) before the inner right-hand and left-hand sides, but execution still 
proceeds one statement at a time. The situation is similar for the mask in a 
nested WHERE statement. 

One way to visualize this process is shown in Figure 4.1. The diagram to 
the left of the I gure illustrates the data dependences possible in the example 
code 




Fig. 4.1. Visualizations of FORALL and DO 
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FDRALL ( I = 2:4 ) 

A(I) = A(I-l) + A(I+1) 

C(I) = B(I) * A(I+1) 

END FDRALL 

n the I gure, each elemental operation is shown as an oval. The I rst row 
represents the A(I-1)+A(I+1) computations: the second row represents the 
assignments to A (I); the third row represents the computation of B(I) * 
A(I+1) ; the bottom row represents the assignments to C(I) . The numbers in 
the ovals are based on the initial values 

A(l:5) = (/ 0, 1, 2, 3, 4 /) 

BC1:5) = (/ 0, 10, 20, 30, 40 /) 

The reader can verify that the semantics above lead to a I nal result of 

A(2:4) = (/ 2, 4, 6 /) 

C(2:4) = (/ 40, 120, 120 /) 

Arrows in the diagram represent worst-case data dependences; that is, a 
FDRALL could have any one of those dependences. (To simplify the picture, 
transitive dependences are not shown.) These dependences arise in two ways: 

From right-hand side to left-hand side: the left-hand side may overwrite 
data needed to compute the right-hand side. 

From left-hand side to right-hand side: the right-hand side may use data 
assigned by the left-hand side. 

By inspection of the diagram, it is clear that there are no connections running 
across rows. This is true in general, and indicates that it is always legal to 
execute all right-hand and all left-hand sides simultaneously. Also, it appears 
from the diagram that a global synchronization is needed between adjacent 
rows. This is also true in general, but represents an uncommon worst case. 

n the diagram, dark arrows represent dependences that actually arise in this 
example; light arrows are worst-case dependences that do not occur here. t 
is often the case, as here, that many worst-case dependences do not occur 
in a particular case; a good compiler will detect these cases and simplify 
communication and synchronization accordingly. n this case, for example, 
there is no need for synchronization during execution of the second statement. 

t is useful to contrast thFDRALL diagram with the corresponding depen- 
dence structure for an ordinary DD loop with the same body. The DD is shown 
on the right side of Figure 4.1. One can immediately see that the dependence 
structures are dij erent, and verify that the result of the DD loop has changed 
to 



A(2:4) = (/ 2, 5, 9 /) 
C(2:4) = (/ 20, 60, 120 /) 
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At Irst glance, the diagram appears simpler. However, the dependence arcs 
from the bottom of each iteration to the top eliminate parallelism in the gen- 
eral case. Of course, there are many cases where these serializing dependences 
do not occur parallelizing and vectorizing compilers work by detecting those 
cases. However, if the analysis fails the DO will run sequentially, while the 
FORALL can run in parallel. 

The example above could also be written using array assignments. 

A(2:4) = A(l:3) + A(3:5) 

C(2:4) = B(2:4) * A(3:5) 



However, FORALL can also access and assign to more complex array regions 
than can easily be expressed by array assignment. For example, consider the 
following FORALL statements. 



FORALL ( I = 
FORALL ( J = 
FORALL ( K = 
FORALL ( L 
END FORALL 



1:N ) A(I,I) = B(I,N-I+1) 

1 :N ) C(INDX(J) , J) = J*J 
1:N ) 

= 1:J ) D(K,L) = E( K*(K-l)/2 + L ) 



The I rst statement accesses the anti-diagonal of B and assigns it to the main 
diagonal of A. The anti-diagonal could be accessed using advanced features of 
the array syntax; there is no way in to do an array assignment to a diagonal 
or other non-rectangular region. The second statement does a computation 
using the values of the index and assigns them to an irregular region of the 
array C. Again, the right-hand side could be done using array syntax by 
creating a new array, but the left-hand side is too irregular to express by 
regular sections. Finally, the last FORALL nest unpacks the one-dimensional 
E array into the lower triangular region of the D array. Neither the left nor 
right-hand side can easily be expressed in array syntax alone. 

4.1.2 PURE Functions. FORALL provides a form of "parallel loop' over the 
elements of an array, but the only statements allowed in the loop body are 
various forms of assignment. Many applications benelt from more complex 
operations on each element, such as point- wise iteration to convergence. HPF 
addresses these needs by allowing a class of functions the PURE functions 
to be called in FORALL assignments. PURE functions are safe to call from within 
FORALL because they cannot have side ej ects; therefore, they cannot create 
new dependences in the FORALL statement s execution. t is useful to call 
PURE functions because they allow very complex operations to be performed, 
including internal control Cbw that cannot easily be expressed directly in the 
body of a FORALL. 

Syntactically, a PURE function is declared by adding the keyword PURE to 
its interface before the function s typ^URE functions must have an explicit 
interface, so this declaration is visible to both the caller and the function 
itself. The more interesting syntactic issue is what can be included in the 
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PURE function. The HPF 2.0 specif cation has a long list of restrictions on 
statements that can be included. The simple statement of these restrictions 
is that any construct that could assign to a global variable or a dummy 
argument is not allowed. This includes obvious cases such as using a global 
variable on the left-hand side of an assignment and less obvious ones such as 
using a dummy argument as the target of a pointer assignment. (The latter 
case does not directly change the dummy, but allows later uncontrolled side 
ej ects through the pointer.) Of course, this list of restrictions leads directly 
to the desired lack of side ej ects in the function. t is important to note that 
there are no restrictions on control Cbw in the function, except that STOP 
and PAUSE statements are not allowed. This allows quite complex iterative 
algorithms to be implemented in PURE functions. 



! The caller 

FDRALL ( 1=1 :N, J=1:M ) 

K(I,J) = MANDELBROTC CMPLX ( (I-l) *1 . 0/ (N-1) , & 
.0/(M-D) , 1000 ) 

END FORALL 
! The callee 

PURE INTEGER FUNCTION MANDELBROT (X, ITOL) 

COMPLEX, INTENT(IN) :: X 
INTEGER, INTENT(IN) :: ITOL 
COMPLEX XTMP 
INTEGER K 
K = 0 
XTMP = -X 

DO WHILE (ABS (XTMP) <2.0 .AND. K<IT0L) 

XTMP = XTMP+XTMP - X 
K = K + 1 
END DO 

MANDELBROT = K 

END FUNCTION MANDELBROT 



Fig. 4.2. Mandelbrot set computation by a PURE function 



A short example illustrates both the last point about control Cbw and 
hints at PURE functions power. onsider the code in Figure 4.2. This code 
(which was one of the I rst HPF programs publicly demonstrated) creates 
a picture of the Mandelbrot set. Notice that the MANDELBROT function uses 
local variables to iterate on every point it is passed. 

4.1.3 The INDEPENDENT Assertion. Array assignment and FORALL 
are new statements with new semantics that are convenient for programmers 
and e ciently implementable. The purpose of the INDEPENDENT directive is 
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somewhat dij erent it gives the compiler new information about a standard 
DO loop. n particularINDEPENDENT tells the system that serializing worst- 
case behavior does not occur; this allows the compiler to run the loop in 
parallel. ompilers on serial machines can use the same information in other 
ways, for example to manage cache. Thus, INDEPENDENT is not really a ~par- 
allel loop' (although it can be used that way); it is information about the 
program that many systems I nd useful. 

The syntax of INDEPENDENT is 

!HPF$ INDEPENDENT [ , LOCAL ( variable-list ) ] [ , REDUCTION ( 

variable-list ) ] 

The directive must immediately precede a DO statement or FORALL statement, 
and describes the behavior of only that statement. 

When HPF I.O was under development, it rapidly became clear that dif- 
ferent people had dij erent delnitions 'can be executed in parallel.' n the 
end, HPFF chose a fairly restrictive, mathematical del nition of INDEPENDENT. 

A loop is INDEPENDENT if no iteration interferes with any other iteration. f 
there is no LOCAL or REDUCTION clause, two iterations interfere if any of the 
following occur: 

Both iterations set the value of the same atomic object. (An atomic object 
is a Fortran object that does not contain another object. For example, an 
integer is an atomic object; an array of integers is an object that is not 
atomic.) 

One iteration sets the value of an atomic object, and the other uses the 
value of the same object. 

One iteration allocates or deallocates an object that is set or used by the 
other iteration. 

One iteration remaps an array that is set or used by another iteration. 
Both iterations perform /O on the same lie. 

One iteration exits the loop by GOTO or other transfer of control, or stops 
execution by STOP or PAUSE. 

Formally, LOCAL is an assertion that no values Oow into the named variables 
from before the iteration nor Cbw from the variables after the iteration. This 
implies that the variables can be removed from consideration for causing in- 
terference; in ej ect, it creates variables that are local to each iteration. (Note 
that on shared-memory machines this forces each processor to have a sep- 
arate copy of the LOCAL variables.) The REDUCTION clause asserts that the 
named variables are updated by associative and commutative intrinsic oper- 
ations within the lexical body of the loop, and that the variable is not used 
elsewhere in the loop. Although this is formally a type of interference, it is a 
well-structured type with e cient parallel implementations. (Note that extra 
data copies are often needed in these implementations as well.) Because of 
the lack of interference, an INDEPENDENT loop can always in principle be exe- 
cuted with a single synchronization point at the beginning and another at the 
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end. ontrast this with the row-by-row synchronizations needed forFORALL. 
Less obviously, all data needed for an iteration is available either before the 
loop begins or is generated within the iteration itself. This information can 
be used to optimize data movement. 

Two things should be noted about the delnition of interference. 

1. Behavior, not syntax, causes interference. For example, a READ statement 
from a I Ic into an clement of a global array is perfectly allowable inside 
an INDEPENDENT loop, so long as it is only executed in one iteration(or the 
I le and array element change on every iteration. This is in contrast to the 
FORALL statement, which limits its body to various forms of assignment 
statements. 

2. nterference invalidates independence, even if it wouldnB matter.Tlie 
REDUCTION clause was added to HPF 2.0 because experience with HPF 1 
showed that users did not consider reductions to be ''real' interference. 

t is still the case in HPF 2, however, that nondetermistic algorithms 
are not INDEPENDENT: although any answer from the algorithm might 
be acceptable in principle, the fact that dij erent answers are possible 
indicates there was a write-write interference. 

This combination provides a great deal of freedom for basic parallelization, 
but does not support complex synchronization patterns or some advanced 
algorithms. 

Figure 4.3 gives the data dependence diagram for an INDEPENDENT DO 
loop. The ovals and arrows have the same meaning as in Figure 4.1, but the 
example is now 

!HPF$ INDEPENDENT 
DO J = 1, 3 

A(J) = A(B(J)) 

C(A(J)) = A(J)*B(A(J)) 

END DO 

Numbers in the diagram correspond to the initial data 



A(1 


8) 


= (/ 0, 


2, 4, 


6, 


1, 3, 


5, 7 /) 


B(1 


8) 


= (/ 6, 


5, 4, 


3, 


2, 3, 


4, 5 /) 


C(1 


8) 


= (/ -1 


,-1,-1 


,-l 


,-1,-1 


,-1,-1 /) 



The reader can verify that the I nal result will be 



A(l:8) 


= (/ 


3, 


1, 


6, 


6, 


1, 


3, 


5, 


7 


/) 


B(l:8) 


= (/ 


6, 


5, 


4, 


3, 


2, 


3, 


4, 


5 


/) 


C(l:8) 


= (/ 


6,' 


-1, 


12,- 


-1, 


-1, 


18,- 


-1, 


-1 


/) 



The key point to notice is that all dependences between iterations have been 
severed; the worst case of an INDEPENDENT loop is the best case that FORALL 
statements had to be analyzed for in Section 4.1.1. Notice that, since the 
INDEPENDENT assertion is correct, these are the same answers that would 
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Fig. 4.3. Visualization of INDEPENDENT loop 



have been obtained by an ordinary DO loop. However, INDEPENDENT allows 
the loop to execute in parallel. 

The loop in the preceding paragraph also illustrates why INDEPENDENT is 
necessary even for advanced compilers. Although static analysis of programs 
has advanced far, no compiler could have detected independence in that loop. 
This is because the independence came from the data values which would 
not be available at compile time. Not knowing the values in the B array, 
the compiler would have no choice but to make the worst-case serializing 
assumptions. Although this example is somewhat contrived, it is similar to 
the application- and algorithm-level information that is available for many 
programs. t is often the case that an algorithm guarantees that a graph 
is acyclic by construction, or that an index vector is a permutation, or that 
the data structure has some other property. Application-level knowledge may 
also indicate properties of the input with similar implications. These high- 
level facts are generally not obvious by inspection of the program (indeed, 
they are often publishable research results in their own right). However, pro- 
grammers often are aware of such properties and realize their implications 
for parallelism. INDEPENDENT provides a mechanism to pass this knowledge to 
the compiler in a form that it can use. Because the directive is phrased as an 
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assertion, the compiler can further propagate the information to parallelize 
other loops with the same access pattern. 

4.1.4 The HPF Library. n addition to elemental operations, several com- 
mon operations have parallelism that scales with the size of the data struc- 
ture. Reductions, as we just saw, are a frequent example. Sorting can also be 
done in parallel. However, the optimal algorithm for even simple operations 
such as reductions vary widely from architecture to architecture. Moreover, 
the best algorithms may be rather intricate and di cult to express in terms 
of FDRALL statements or INDEPENDENT loops. Due to the importance of these 
high-level operations, however, it is important to provide them to program- 
mers. HPF 2.0 does this via the HPF library, a collection of useful functions. 

n ej ect, this library puts the onus on the compiler and system vendors to 
provide e cient libraries for their users. 

Due to space limitations, we cannot list all 57 functions and subroutines 
in the HPF library. nstead, we discuss the general classes of functionality. 

System inquiry intrinsic functions return information about the state of the 
parallel machine. There are two of these functions: NUMBER_OF_PROCESSDRS, 
which returns the number of available processors, and PRDCESSORS_SHAPE, 
which returns the "natural' processors arrangement. They can be used to 
adapt programs to the machine, for example by sizing arrays to an even 
multiple of the number of processors. 

Mapping inquiry subroutines return the mapping of an object (typically 
an array). There are three of them: HPF_ALIGNMENT, HPF_DISTRIBUTION, 
and HPF_TEMPLATE. They are particularly useful for checking transcriptive 
dummy arguments or dynamically mapped arrays to ensure that an appro- 
priate algorithm is used for each data mapping. 

Bit manipulation functions operate on the bits of an integer representation. 
These four functions are useful for data compression, cryptography, and 
other bitwise algorithms. 

Array reduction functions use associative and commutative functions to 
combine elements of an array. Fortran 90 introduced several such func- 
tions, including SUM, PRODUCT aird MAXVAL, but did irot provide a reduction 
for every built-in associative and commutative operator. HPF completes 
the set by adding four functions: I ALL (bitwise and), IANY (bitwise OR), 
PARITY (logical exclusive or), and IPARITY (bitwise exclusive or). 

Array preix and suJ x functions (sometimes called scan functions) com- 
pute sets of partial reductions of arrays. The Irst element of a preix sum 
is the Irst element of the input; the second output element is the sum of 
the Irst two input elements; and in general the A^th output element is the 
sum of the Irst N elements. HPF provides a preix and a su x function 
for each built-in associative and commutative operation, named after the 
corresponding reduction (for example, SUM_PREFIX and SUM_SUFFIX). The 
utility functions CDPY_PREFIX and COPY_SUFFIX are also available. All of 
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these functions include options for masking elements and performing seg- 
mented scan operations. There are 24 of these functions in all. Work at 
arnegie Mellon University has shown how these operations can be used 
for a variety of irregular data parallel algorithms. 

Array combining scatter functions scatter their input, resolving collisions 
by an associative and commutative operation. For example, the sum com- 
bining scatter could be implemented sequentially by 

A = 0 

DO I = 1, N 

A(INDXd)) = A(INDXd)) + B(I) 

END DO 



There is a combining scatter function for each built-in associative and com- 
mutative operator plus the COPY operation, giving a total of 12 functions. 
These functions are also useful in irregular applications; they also appear 
naturally in constructing the matrices for I nite element methods. 

Array sorting functions either permute an array into monotonic order, or 
return a permutation vector that will do the job. The functions that return 
a sorted array are SORTJJP and S0RT_D0WN; those that return a permutation 
vector are GRADE_UP and GRADE_D0WN. 

A few more library functions supporting advanced features (such as the new 
DISTRIBUTE patterns) are ineluded as approved extensions. t should be noted 
that many of those operations are actually generic interfaces that accept 
several numeric types as input; the actual runtime library is likely much 
larger. 

4.2 Advanced Topics 

The features in Section 4.1 provide convenient abstractions for data paral- 
lelism. Some applications, however, require more detailed performance tun- 
ing. HPF 2.0 s approved extensions provide two directives to support this. 
The ON directive identiles the processor to execute a statement or block of 
statements. This can be used to override the owner-computes rule (or other 
compilation mechanism) in cases where it is inappropriate. The RESIDENT di- 
rective gives the compiler more information about data locality, n particular, 
it asserts that, give previous data mapping and ON directives, certain data 
references do not require communication. Both directives give the compiler 
information that it would not otherwise have. The compiler may use this ad- 
ditional information in any way that it deems appropriate; presumably, this 
will improve parallelism and reduce overhead on parallel architectures, but 
there may be other applications on sequential processors. 
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4.2.1 The ON Directive. Although the HPF language delnition does not 
specify how data-parallel operations are implemented, in practice most com- 
pilers attempt to statically schedule the computation to occur where the map- 
ping directives store the data. This ''owner-computes rule' is a good heuristic 
in general, but like any heuristic it does not work in all cases. Even when 
it works, there are often several ways to interpret the heuristic which may 
give dij erent results. Finally, many users have done careful analysis to de- 
termine optimal combinations of computation scheduling and data mapping. 
For all these reasons, we need the option of I ne-level control of scheduling 
computation to processors. The HPF 2.0 approved extensions include the ON 
directive for this purpose, t gives a high-level suggestion of where to perform 
a computation, much as the DISTRIBUTE directive suggests where the data 
should be mapped. 

The ON directive has a single-statement and a multi-statement form. The 
single-statement form has two variants: 

!HPF$ ON HOMEC variable ) 

OR 

!HPF$ ON ( processor-section ) 

Variable is the Fortran syntax for any reference to a named object, includ- 
ing simple references like X and more complex ones like Y(1 : N) 7oFIELDl. A 
processor-section is an element or section of an HPF processors arrangement. 
We refer to either the 'variable or processor-section as the HOME clause. The 
two forms for the multi-statement form are similar: 

!HPF$ ON HOMEC variable ) BEGIN 

block 

! HPF$ END [ ON ] 

OR 

!HPF$ ON ( process or- section ) BEGIN 

block 

! HPF$ END [ ON ] 

The single-statement form controls the execution of the statement imme- 
diately following it; the multi-statement form controls all statements in its 
block. n either case, thfiOME clause names the processor (s) to perform the 
computation, either directly (the processors-section form) or by indicating 
the processors where some data is mapped (the variable option). A program- 
mer can think of the statement as executed in three steps: 

1. Gather any data not already on the processors indicated in the HOME 
clause. 

2. omputc results from the data. 

3. Scatter the results back to any changed variables that are not stored on 
the HOME processor. 
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This is somewhat oversimplil ed, since the statement may be a DO loop or 
other complex construct that requires several rounds of communication. How- 
ever, even in complex cases the idea is that the HOME processors perform the 
computation, perhaps on data communicated from somewhere else. ON direc- 
tives can name more than one processor in the HOME clause; in these cases 
the named processors cooperate to execute the statement. 

n ei ect, the HPF program begins with all processors active, and eaclfflN 
directive masks out the processors that do not match its HOME clause. The 
masked processors can skip ahead to other computations if they do not need 
data produced in the ON block that excluded them. n particular, when thON 
block is an iteration of an INDEPENDENT loop, processors can immediately oc- 
cur to the I rst iteration where they are included in a HOME clause. f the loop 
is not INDEPENDENT, it must perform synchronization whenever two interfer- 
ing iterations are scheduled on dij erent processors. Scheduling the iterations 
of a loop in this way is probably the most common usage of the ON direc- 
tive. n this case, a good compiler will invert any functions used in thflOME 
clause and propagate this information to the loop header. The programmer 
can therefore balance the computational load by choosing a HOME clause that 
evenly spreads the iterations to all processors. 

Nested ON blocks are allowed, so long as the inner HOME clause indicates a 
subset of the processors in the outer clause. Again, the behavior is essentially 
that each level of ON directive masks out some processors. n the absence 
of data dependence constraints, the masked processors can occur to other 
computations. This provides a form of nested parallelism. As we will see in 
Section 5.2, it can also be used for pipelined parallelism. 

Figure 4.4 illustrates this process. (The RESIDENT directives will be ex- 
plained in Section 4.2.2.) These are typical loop structures for constructing 
unstructured I nite element matrices or performing relaxations on unstruc- 
tured grids. The ON directive in the DO I loop tells the compiler to statically 
schedule each iteration on the processor that owns the corresponding element 
of IXl. Processor P(l) will execute iterations 1, 5, 9, and so forth: processor 
P(4) will execute iterations that are divisible by 4. No matter what N is, 
the iterations will be approximately evenly balanced among processors. ( f 
1X1 were distributed by BLOCK, the computation would not be balanced for 
N<1000.) This form of the HOME clause is appropriate for many loops over the 
elements of an array. The ON clause in the DO J loop tells the compiler to 
execute iteration J on the processor storing element Y(IX2(J)). Assuming a 
parallel implementation, load balance depends on the values in 1X2. f they 
are evenly spread over the range from 1 to 500, then the load will be balanced; 
if they are concentrated in one processor s block (for example, if half of them 
fall in the range 1 to 100), then that processor will become a bottleneck. The 
advantage of such complex ON clauses typically comes from controlling data 
movement, as wc will see in Section 4.2.2; this must be tempered by the cost of 
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REAL X(500), Y(500), Z(IOOO) 

INTEGER IX(IOOO), lY(lOOO) 

!HPE$ PROCESSORS P(4) 

!HPE$ DISTRIBUTE X (BLOCK) ONTO P 
!HPF$ DISTRIBUTE IXl (CYCLIC) ONTO P 
!HPF$ ALIGN 1X2(1) WITH 1X1(1) 

!HPF$ ALIGN Y(J) WITH X(J) 

!HPF$ ALIGN Z(K) WITH 1X1 (K) 

!HPF$ INDEPENDENT, LOCAL(TMP), REDUCTION(X) 
DO I = 1, N 

!HPF$ ON H0ME( 1X1(1) ) BEGIN 
!HPF$ RESIDENT ( Y(IX1(D) ) 

TMP = Z(I) * (Y(IX1(I))-Y(IX2(I))) 
X(IXKI)) = X(IXKI)) + TMP 

X(IX2(D) = X(IX2(D) - TMP 

!HPF$ END ON 
END DO 

!HPF$ INDEPENDENT, LOCAL(TMP), REDUCTION(X) 
DO J = 1, M 

!HPF$ ON HOME( Y(IX2(J)) ) BEGIN 
!HPF$ RESIDENT ( Y(IX1(J)) ) 

TMP = Z(J) * (Y(IX1(J))-Y(IX2(J))) 
X(IXKJ)) = X(IXKJ)) + TMP 

X(IX2(J)) = X(IX2(J)) - TMP 

!HPF$ END ON 
END DO 



Fig. 4.4. Example of ON and RESIDENT directives 



setting np the loop. n thel example, the nmtime system will have to gather 
array 1X2, examine the elements, and create a list for each processor of the 
iterations it is responsible for. This examination can be performed in parallel, 
with the results saved and reused if the loop is executed again with the same 
1X2 array; this strategy is called the inspector-executor paradigm [13,20], 
and is used in production compilers. Still, this is a signil cant overhead, both 
in compilation technology and in runtime performance. 

Note that if both the I and J loops are executed with the same data, the 
I nal values in the X array will be the same. This again illustrates that ON is 
a directive, and as such does not change the semantics of the code. 

4.2.2 The RESIDENT Directive. n addition to load balancing, thflN 
directive can aj ect the data movement required by a program. onceptually, 
if a computation on processor P requires data on processor Q, then the data 
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must be communicated. The underlying hardware may have dij erent ways 
of expressing this, from explicit messages to to simple shared memory loads, 
but it is communication at some level. Often the programmer knows that 
communication is not needed: if this information can be given to the com- 
piler, signilcant optimizations are possible. This is the task of the RESIDENT 
directive, one of the HPF 2.0 approved extensions. 

Like the ON directive, RESIDENT has a single statement and a multi- 
statement form. The syntax for the single-statement form is 

!HPF$ RESIDENT [ ( variable-list ) ] 

The multi-statement form uses BEGIN and END as the multi-statement ON 
does. RESIDENT can also be added as a clause of the ON directive. 

Like the INDEPENDENT directive, RESIDENT is an assertion that, if true, 
will not change the results computed by the program. n this case, the asser- 
tion is that the variable references in the list reside on the currently active 
processor (s). Note that in order to make this assertion, one needs to know 
two things: 

1. The location of the computation, which is set by the ON directive. 

2. The location of the data, which is set by the ALIGN and DISTRIBUTE 
directives. 

Therefore, RESIDENT can only be asserted when both data mapping and ON 
directives have been used. For variables that are assigned to, all copies of 
the variable must be on active processors; for variables that are only refer- 
enced, only some copies need be stored there. n either case, this indieates to 
the compiler that no communication needs to be generated for the variable. 
Finally, if the variable list is omitted, then RESIDENT is asserted for all data. 

To illustrate the kinds of assertions possible with RESIDENT, we turn again 
to Figure 4.4 on page 32. n th®0 I loop there, we have guaranteed that 
the computational load is evenly balanced. This does not, however, imply 
that there is no communication; in fact, the compiler will have to make some 
worst-case assumptions in this area. With no RESIDENT clause, the compiler 
can only assume that references to 1X1(1), 1X2(1), Z(I), and TMP are res- 
ident (because the Irst three are aligned together, and the last is a LOCAL 
variable). This leaves the compiler to generate communication for all ref- 
erences to X and Y. Although the volume of data depends on the contents 
of IXl and 1X2, one would generally expect a gather of Y before the loop 
and a combining scatter of X after the loop. With the RESIDENT directive as 
given, the compiler can additionally assume that Y(IX1(D) and X(IX1(D) 
are resident (because the I rst is declared so, and the second has an identical 
access pattern and data distribution). The system can therefore avoid gen- 
erating communication for these references. While a gather and scatter are 
still required, they will have a simpler setup phase. n th®0 J loop, without 
the RESIDENT directive the compiler could similarly detect that Y(IX1(J)), 
X(IX1(J)), and TMP are resident; communications would be generated for 
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Y(IX2(J)), X(IX2(J)), IXl(J), IX2(J), and Z(J). (Note that a reference 
being resident does not imply that all its subscripts are as well; for example, 
Y(IX1(J)) is resident but IXl(J) is not.) With the RESIDENT directive, the 
compiler can add Y(IX1 (J) ) and X(IX1 (J) ) to the list of resident references, 
and only generate communication for arrays IXl, 1X2, and Z. n this case, 
not only is the setup simpler, but we completely avoid communication for 
two arrays. A programmer could avoid all communication in the J loop by 
using the simple directive 

!HPF$ RESIDENT 
in place of the line 

!HPF$ RESIDENT ( Y(IX1(J)) ) 

This would imply that in all cases. 



IX1(J)-1 

125 



IX2(J)-1 

125 



= J-1 mod 4 



The terms represent the processor homes of X(IX1(J)), X(IX2(J)), and 
1X1 (J), respectively: due to the ALIGN directives, they also cover all other 
references in the loop. 



5. Task Parallelism 

Data parallelism is an important abstraction for many scientil c applications, 
in part because vector notation and linear algebra are very common in math- 
ematics and science. However, not every problem benelts from data paral- 
lelism. For example, the popular 'task pool' style of parallelism, in which 
worker processes dynamically extract tasks from a central server, does not 
map well onto array syntax. Another example that we have already seen is 
nondeterministic algorithms such as chaotic relaxation or parallel alpha-beta 
searching. Finally, sometimes low-level control of communications mecha- 
nisms or other machine-specil c operations is needed for e ciency. For all 
these reasons, HPF 2.0 provides interfaces to parallel programming mod- 
els other than data parallelism. EXTRINSIC procedures are a feature of core 
HPF 2.0 that act as an 'escape hatch' into task-parallel programming en- 
vironments and possibly other languages. The TASK_REGI0N construct is an 
HPF 2.0 approved extension that integrates task and data parallelism with- 
out leaving the HPF language. 



5.1 EXTRINSIC Procedures 

The purpose of EXTRINSIC interfaces is to allow HPF programmers to take 
advantage of other programming paradigms when it is appropriate. This 
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ability is often used to provide access to libraries written elsewhere, tune 
performance-critical parts of the program, and express non-data-parallel al- 
gorithms (including nondeterministic algorithms) on data mapped by HPF 
directives. The basic concept is that the EXTRINSIC procedure itself is writ- 
ten in another language, such as Fortran with message-passing extensions. 
HPF del nes the interaction with this language at the interface, allowing the 
HPF compiler to be sure of its assumptions and giving the other language 
the information that it needs about data layout. Because the needs of other 
languages are diverse, EXTRINSIC is actually a family of interfaces, essentially 
one per language. 

Syntactically, the EXTRINSIC interface is a new preix to the FUNCTION or 
SUBROUTINE header. The syntax of the new prel x clause is 

EXTRINSIC ( extrinsic-kind-keyword ) 

OR 

EXTRINSIC ( extrinsic-spec-arg-list ) 

HPF 2.0 del nes three extrinsic-kind-keywords ■. HPF, HPF_L0CAL, and HPF_- 
SERIAL; several more are delned as approved extensions, including FOR- 
TRAN 77 and language bindings. We will describeHPF_LOCAL in more de- 
tail below; its del nition is typical of other EXTRINSIC types. EXTRINSIC (HPF) 
is simply HPF; it was delned to reserve the name if EXTRINSIC was adopted 
by other Fortran dialects. HPF_SERIAL procedures execute on one and only 
one processor; they are designed for interfacing to graphics and other libraries 
that cannot be called in parallel. The extrinsic- spec- arg form allows modular 
del nition of new extrinsic types as a combination of LANGUAGE=, M0DEL=, and 
EXTERNAL_NAME= specif cations. For example, HPF_L0CAL is formally delned 
as (LANGUAGE= HPF , MQDEL= LOCAL ). 

Semantically, EXTRINSIC (HPF_L0CAL) is a contract between the main 
HPF program and the HPF_LO AL function. The EXTRINSIC (HPF_L0CAL) 
contract states that the HPF main program will 

1. Have an explicit interface with the EXTRINSIC directive 

2. Remap data (if needed) to meet distribution specif cations in that inter- 
face 

3. Synchronize all active processors before the call 

4. all the local routine on every active processor 

The f rst three guarantees arc common to all forms of EXTRINSIC; they as- 
sure that the data and computation arc in a consistent state before the 
procedure begins execution. The Inal guarantee is the key part of any 
EXTRINSIC (M0DEL= LOCAL , . . . ) interface; it starts the procedure on every 
processor in a single-program multiple-data mode. Before returning, however, 
the EXTRINSIC (HPF_LDCAL) contract requires that the procedure will 

1. Obey INTENT(IN) and INTENT(OUT) declarations 

2, Ensure that if variables are replicated, they will again be consistent before 
return 
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3. Only access data that could be accessed by an HPF program with the 
same interface 

4. Synchronize the active processors before returning control to HPF 

5. Declare all dummy arguments to be of assumed shape 

The Irst four constraints are common to all EXTRINSIC models; they ensure 
that data is consistent when HPF returns. n particular, the HPHfO AL 
procedure cannot leave processes running that may corrupt HPF data asyn- 
chronously with the main program. The constraint on using assumed-shape 
arguments is peculiar to EXTRINSIC (LANGUAGE= HPF , . . .) procedures; it 
supports HPF inquiry functions in the HPF_LO AL procedure. 

The HPF_LO AL procedure itself is a task-parallel program. Every pro- 
cessor executes its own copy of the procedure, and can have processor-local 
state such as the processor id. Note that global HPF does not have vari- 
ables local to a processor. The closest concept is the LOCAL clause of the 
INDEPENDENT directive, which cannot save data after the iteration is com- 
plete. The reason for this is simple processor-local data can destroy the 
single-threaded execution model of HPF. Within the HPF_LO AL model, 
however, this is not a problem. Although the processors are initially executing 
as SPMD processes, the procedure can change to loosely synchronous or fully 
asynchronous mode, in any way that the hardware and HPF_LO AL com- 
piler permit. Mapped arrays can be passed as arguments, but within the the 
IIPF_LO AL procedure they appear as ordinary arrays consisting of only the 
data mapped to the local processor. n other words, refcrenccA ( 1 ) inside an 
HPF_LO AL routine refers to the I rst element stored on the local processor, 
while in a global HPF program A ( 1 ) refers to the I rst element of the global ar- 
ray, no matter which processor stores it. The HPF_LO AL.L BRARY mod- 
ule provides utility functions to translate between these views of the data. 
The HPF_LO AL model itself does not provide synchronization or commu- 
nication primitives, but it is compatible with such standard interfaces as the 
Alessage Passing nterface (MP ). 

Figure 5.1 demonstrates some features of EXTRINSIC (HPF_L0CAL) . The 
top section of the I gure is a standard HPF program that descriptively passes 
two mapped arrays to subroutine F. The bottom shows an HPF_LO AL 
subroutine which performs an iterative operation on each processor. Since the 
subroutine has no internal synchronization, each processor runs completely 
independently of the others; dij erent processors may well perform dij erent 
numbers of DO WHILE iterations. Also, the CSHIFT and MAXVAL intrinsics in 
F only operate on local data, rather than communicating data from other 
processors. HPF.LO AL subroutines are excellent vehicles for performing 
such per-processor operations. 




High Performance Fortran 2.0 



37 



REAL a(0:n,0:n), b(0:n,0:n) 

!HPF$ DISTRIBUTE a(BL0CK,*) 

!HPE$ ALIGN b(: , :) WITH u(: , :) 

INTERFACE 

EXTRINSIC (HPF_L0CAL) SUBROUTINE F( X, Y ) 
REAL, INTENT(IN) : : X(: , :) 

REAL, INTENT (OUT) :: Y(:,:) 

!HPF$ DISTRIBUTE X * (BLOCK,*) 

!HPF$ ALIGN Y(:,:) WITH *X(:,:) 

END INTERFACE 

CALL F( A, B ) 



EXTRINSIC (HPF.LOCAL) SUBROUTINE F( X, Y ) 
REAL, INTENT(IN) :: X(:,:) 

REAL, INTENT (OUT) :: Y(:,:) 

REAL ERR; INTEGER Ml, M2 

M1=SIZE(X,1) ; M2=SIZE(X,2) 

DO WHILE (ERR > lE-6) 

Y = (CSHIFT(X,1,1)+CSHIFT(X,1,2)+X) / 3 
ERR = MAXVAL(ABS(X-Y)) 

X = Y 
END DO 
END 



Fig. 5.1. Example of HPF_LO AL 



5.2 The TASK_REGION Directive 

The EXTRINSIC mechanism is very good for interfacing to other programming 
paradigms. However, to use it a programmer must leave the HPF environ- 
ment, if only temporarily. Some users requested extensions to HPF that would 
allow task-parallel programming within the language itself. After exploring 
several possibilities, including HPF bindings for MP and P F, the HPFF de- 
signed the TASK_REGI0N directive to provide abstract, large-grain tasks where 
each task might itself be data-parallel. This model is very similar to the FX 
language developed at arnegie Mellon University [21].TASK_REGIDN is an 
approved extension of HPF 2.0. 

The syntax of a TASK_REGIDN is 

!HPF$ TASKJIEGION 

block 

!HPF$ END TASK_REGIDN 
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The code block itself consists of ON blocks and statements not guarded by an 
ON directive. As we will see below, each ON block acts as a task; statements not 
in any ON block act as synchronization. Any outer-level ON directive must have 
the RESIDENT option without a variable list. This ensures that processors in 
the task will need only synchronize and communicate with other processors 
at the beginning and end of the ON block. 

As a directive, TASK_REGION does not change the computed results of 
the program, but may aj ect execution e ciency. Formally, the directive is 
an assertion that ON blocks whose active processor sets do not overlap do 
not interfere with eaeh other. The same delnition of interference is used 
as for INDEPENDENT, including the restrictions on input/output. Because of 
these restrictions, the TASK_REGION can be executed by synchronizing only 
the processors that will take part in an ON block at the block s beginning and 
end. Processors not selected by the HOME clause can continue on to the next 
statement. Statements outside of any ON block require synchronization of all 
processors that store mapped data used by the statement; scalar variables can 
be replicated on all processors and updated in replicated fashion. n ej ect, 
each processor executes a reduced TASK_REGION that leaves out ON blocks 
where it does not participate. An HPF program without the TASK_REGION 
directive could not eliminate the 'extra' ON blocks because they might require 
synchronization. 

This is perhaps easier to understand as an example. onsider Figure 5.2. 
This program creates a two-task pipeline for performing a series of 2-D FFTs. 
The Irst task reads data into array A1 and performs an FFT on each row. 
The data is then copied to array A2. Task 2 performs the column FFTs and 
outputs the results to a separate I le. Because the two arrays are mapped 
to di( erent processor sets, each execution of task 2 can be overlapped with 
the following execution of task f. The array copy between the two is the 
sole required synchronization; updating of the DO index can be performed 
redundantly on all processors. This organization of the program has several 
advantages over other possibilities: 

Both arrays could be distributed on all processors. However, this decreases 
the granularity of computation on each processor; depending on the syn- 
chronization costs of a particular machine, this may lead to a signil cant 
increase in parallel overhead. 

One mapped array could be used. This would require either remapping 
by one of the CALL statements, or porting one of the routines to reCbct 
a dij erent distribution; either way, it would be impossible to overlap the 
input and output, since the same array would be needed for both. 

For a fuller discussion of the merits of the task-parallel approach, one can 
examine the work on the FX [21]. 
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!HPF$ PROCESSORS P(8) 

!HPF$ DISTRIBUTE A1 (BLOCK,*) ONTO P(l:4) 
!HPF$ DISTRIBUTE A2(*, BLOCK) ONTO P(5:8) 
!HPF$ DISTRIBUTE TDl (BLOCK) ONTO P(l:4) 

!HPF$ TASK_REGION 
DO K = 1, N 



! Task 1 

!HPF$ ON HOME(Al) BEGIN, RESIDENT 
READ (lUNITl) A1 
CALL ROWFFTS(Al) 

! HPE$ END ON 

A2 = A1 

! Task 2 

!HPF$ ON H0ME(A2) BEGIN, RESIDENT 
CALL C0LFFTS(A2) 

WRITE (IUNIT2) A2 
!HPF$ END ON 

ENDDO 

!HPF$ END TASK_REGION 



Fig. 5.2. Example of the TASK_REG ON construct 



6. Input and Output 

One of the most di cult aspects of parallel computation is input and output 
of the data. Because the high speeds and large memory capacities of parallel 
machines allow huge problems to be solved, it is not surprising that input and 
output requirements of parallel programs are also high. Addressing these re- 
quirements must be done at all levels of the machine, from hardware through 
systems to the applications running there; as one layer of the system, this in- 
dicates that HPF needs some /O capability. n view of this, the HPFF tried 
in both the discussions leading to HPF 1 and HPF 2.0 to del ne constructs for 
parallel /O. Unfortunately, parallel /O architectures are still very dissimi- 
lar from machine to machine, and widely-accepted unifying abstractions at 
a level comparable to HPF s data parallelism have yet to emerge. t is there- 
fore unsurprising that HPF s support for parallel input and output is not as 
extensive as other parts of the language. However, HPF 2.0 s approved ex- 
tensions do include support for asynchronous /O, which is a signilcant step 
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forward. t should be noted that the Fortran 2000 committee is also working 
on a proposal for asynchronous /O; theirs is somewhat more general, but 
compatible in concept with HPF. t should also be noted that Fortran has 
an extensive set of /O operations, and that HPF inherits all of these. n 
particular, it is legal in Fortran to read or write an entire array to a I le in 
one statement. n an HPF context, that array could well be distributed; this 
provides many opportunities for high-quality compilers and runtime systems 
to exploit any underlying parallel /O capabilities. 

Syntactically, asynchronous /O use new options in thcHPEN statement s 
connection specif cation, new specif cations in the /O control lists, and a 
new WAIT statement. n thflPEN statement, a f le which will be read or writ- 
ten asynchronously must have the ASYNCHRONOUS connect specif cation after 
the usual UNIT= and FILE= specif cations. Asynchronous /O statements are 
identif ed by the presence of the new ASYNCHRONOUS and ID= control specif ca- 
tions. A READ or WRITE statement with these options begins an asynchronous 
operation and sets the ID= variable to a unique value. This value will be used 
later to identify the asynchronous operation in progress. Multiple outstand- 
ing asynchronous / O operations per I le are allowed if they do not reference 
the same record of the fie. The program may not reference any variable in 
the /O list of an asynchronous /O operation until the operation has been 
completed by a WAIT statement. The syntax for the WAIT statement is 

WAIT ( UNIT= io-unit , ID= int-expr [, ERR= label], [, I0STAT= 
label] ) 

The keyword arguments may appear in any order. When executed, the WAIT 
statement halts program execution until the asynchronous operation identi- 
f ed by the ID expression completes. The ERR= and 

Figure 6.1 shows how these operations can be used to create a two-stage 
pipeline. The pipeline is started by opening THEINPUT for asynchronous read- 

□PEN( lUNIT, FILE= THEINPUT , ASYNCHRONOUS ) 

READC lUNIT, ID=ID0, END=100) A 
DO 

WAIT (ID=ID0 ) ! Wait for A 

READ (lUNIT, ID=ID1, END=100) B ! Start B 

CALL PROCESSING ( A ) ! Overlap I/O with compute 

WAIT (ID=ID1 ) ! Wait for B 

READ (lUNIT, ID=ID0, END=100) A ! Start A 

CALL PROCESSING ( B ) ! Overlap I/O with compute 
END DO 

100 CONTINUE 

Fig. 6.1. Example of asynchronous /O 
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ing and beginning the transfer of the I rst block of data into A. From then on, 
execution alternates between overlapping the reading of B and the processing 
of A in the I rst part of the loop and overlapping the reading of A and the 
processing of B in the second part of the loop. f the time for subroutine 
PROCESSING is roughly the same as for reading a block of data, the com- 
puter is always both computing and performing /O. f the processing time 
is much less than the /O time, additional buj ers could be added to achieve 
more overlap. PROCESSING cannot access either A or B as global data, or it 
will violate the constraint against referencing data involved in an outstanding 
asynchronous operation. 



7. Summary and Future Outlook 

As HPF 2.0 makes its appearance, it is appropriate to examine HPF s suc- 
cesses, shortcomings, and plans for the future. 

IIPF compilers are now available for all parallel machines of wide in- 
terest. Some hardware vendors, including BM and Digital, have devel- 
oped their own compilers. Other software vendors, including Portland Group 
and Applied Parallel Research, sell systems that produce portable code by 
outputting message-passing Fortran. User experience with the language is 
growing. A number of interesting HPF applications were reported at the 
I rst HPF Users Group in February 1997 (the proceedings are available 
on the World-Wide Web at http://www.Ianl.gov/HPF/index.html and 
http://www.crpc.rice.edu/HPFF/home.html). t is fair to say that HPF 
is now recognized as an e cient language for a number of problems, partic- 
ularly those delned on regular grids. 

However, HPF is not without its problems. The early implementations 
were often ine cient or did not work at all; many users of those systems 
developed a poor opinion of the language which will be di cult to reverse. 
The language itself provides relatively limited features for some important 
operations, such as irregular meshes, parallel /O and task parallelism. Al- 
though HPF 2.0 improves the support in all these areas, it will be some time 
before implementations will appear and even longer before we will know if 
the new support is su cient. 

Meanwhile, MP has many staunch backers as an alternate programming 
paradigm for parallel machines. We feel that HPF and MP complement 
each other rather than compete; each system has advantages and drawbacks 
for particular problems. Recognizing this, the HPF Forum designed the lan- 
guage to permit convenient interoperability with message passing environ- 
ments through the EXTRINSIC interface mechanism. 

Finally, user feedback at the HPF Users Group meeting and other forums 
has pointed out two major limitations in current HPF support: e cient par- 
allel libraries and usable tools for debugging and performance analysis. Our 
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own research at Rice is directed at these areas, as are several other commer- 
cial and academic projects worldwide. 

There are no current plans for new additions to HPF, although mailing 
lists exist to resolve technical questions and disseminate HPF news. We ex- 
pect the HPF Users Group meetings to be held annually as long as there is 
interest in the language. Judging from the Irst meeting, these will inspire 
a good deal of discussion which will lead to improvements in the language 
implementations, better use of the language, and (probably) more language 
extensions. We look forward to exciting times as the compilers improve and 
as programmers become more adept at exploiting them. 
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Summary. Programming massively-parallel machine is a daunting task for any 
human programmer and parallelization may even be impossible for any compiler. 
Instead, the functional programming paradigm may prove to be an ideal solution by 
providing an implicitly parallel interface to the programmer. We describe here the 
Sisal project (Stream and Iteration in a Single Assignment Language) and its goal 
to provide a general-purpose user interface for a wide range of parallel processing 
platforms. 



1. Introduction 

The history of computing has shown shifts from explicit to implicit program- 
ming. In the early days, computers were programmed in assembly language, 
mostly with the purpose of utilizing the available memory space as eGfectively 
as possible. This came at the cost of obscure, machine-dependent, hard to 
maintain programs, which were designed with high programming eGbrt. For- 
tran was introduced to make programming more implicit, portable and less 
machine-dependent. With the advent of massively parallel computers and 
their promise of hundreds of gigaLbps, we have seen a return to the explicit 
programming paradigm. Using for example C with explicit message passing li- 
brary routines as machine language, people attempt to utilize the available 
processing power to the largest extent, again at the cost of high programming 
eGbrt, machine-dependent, and hard to maintain code. A compiler for an im- 
plicitly parallel programming language alleviates the programmer from the 
task of partitioning program and data over the massively parallel machine. 
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It is our view that explicit parallel programming is a transition stage in 
the evolution of parallel computing and that implicit parallel programming 
languages will eventually become the norm as did high-level languages in 
the sequential paradigm. This will result in a tremendous improvement in 
programming quality in terms of programming eGort, readability, portability, 
extendability and maintainability of parallel code. Another consequence will 
be the accessibility of parallel programming to a wider public that would 
make use of a wide spectrum of parallel computers; from a few processors on 
a chip to several thousand processor-machines. 

Functional programming [6] is an alternate programming paradigm which 
is entirely diGbrent from the conventional model: a functional program can be 
recursively de ned as a composition of functions where each function can itself 
be another composition of functions or a primitive operator (such as arith- 
metic operators, etc.). The programmer need not be concerned with explicit 
speci cation of parallel processes since independent functions are activated 
by the predecessor functions and the data dependencies of the program. This 
also means that control can be distributed. Further, no central memory sys- 
tem is inherent to the model since data is not written in by any instruction 
but is passed from one function to the next. 

Sisal (Stream and Iteration in a Single Assignment Language) [22] is such 
a functional language which was originally designed by collaborating teams 
from the Lawrence Livermore National Laboratory, Colorado State Univer- 
sity, the University of Manchester and Digital Equipment Corporation. The 
goal of the project was to design a general-purpose implicitly parallel lan- 
guage for a wide range of parallel platforms. 

The goal of this paper is to describe the last phases of this project as we 
are currently undertaking them. In section 2, a short tutorial will present the 
basic principles of Sisal. An early compiler implementation for shared memory 
systems is described in section 3. Sisal 90 and its foreign language interface 
is introduced in section 4. We turn our attention to Distributed Memory 
implementations in section 5, while section 6 introduces implementation of 
multithreading principles. Section 7 concludes. 



2. The Sisal Language: A Short Tutorial 

Sisal is a functional language that oGbrs automatic exploitation and manage- 
ment of parallelism as a result of its functional semantics. In Sisal, and all 
functional languages, user-de ned names are identi ers rather than vari- 
ables, and they refer to values rather than memory locations. The values 
produced and used in a Sisal program arc all dynamic entities, and their 
identi Cl'S arc de ned, or bound to them, only for the duration of their exis- 
tence in an execution. This is the dynamic of the data Lbw graph, in which 
graph nodes are operations, and values are carried on the arcs connecting the 
nodes. The extent of the existence of a value is the set of arcs on which it 
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travels between the point of its de nition and the point of its nal consump- 
tion. The values that are de ned by the graph arcs may or may not have 
names assigned to them within a program. 

All Sisal expressions and higher-level syntactic elements evaluate to and 
return values based solely on the values bound to their formal arguments 
and constituent identi ers. This eliminates any possibility of side eGects, and 
allows much richer analyses of program by the compiler than is typically the 
case for imperative languages. 

To best illustrate these points, consider the following brief code fragment. 

It is written in Sisal 1.2, the language currently accepted by the Optimizing 
Sisal Compiler. The Sisal language is undergoing expansion and re nement, 
as discussed in other sections, but the syntax of version 1.2 will su ce for 
this example. 

type OneDim = array [ real ] ; 

type TwoDim = array [ OneDim ] ; 

function generate ( n : integer returns TwoDim, TwoDim ) 
for i in 1 , n cross j in 1 , n 
tl := real(i) * real(j); 
t2 := real(i) / real(j) 
returns array of tl 
array of t2 

end for 

end function 7, generate 

The rst two statements de ne type names for arrays. Note that no sizes 
are provided; all Sisal aggregate data instances are dynamically created, re- 
sized, and dc-allocated at runtime. Only the dimensionality and element types 
are relevant to the type sped cations. The header for function generate 
shows that one integer argument, n , is expected, and two unnamed values 
will be returned. The returned values are two dimensional arrays of single pre- 
cision reals, but again, only typing and not sizing is speci ed. Names can be 
bound to these returned values at the site of invocation of function generate 
if the programmer wishes. An invocation of a function is semantically equiv- 
alent to the reproduction of the function code at that site, with appropriate 
argument substitution. This equivalence, called referential transparency is 
a fundamental property of functional languages, and is responsible for the 
strengths of the Sisal language. This strength lies in a simpli ed analysis 
process for the compiler. Functions can run in parallel if no data dependency 
exists between the functions. Functions with equivalent inputs will al'ways 
return equivalent values. 

All Sisal expressions, including whole functions and programs, evaluate 
to value sets. In the above case, the function evaluates to two arrays, which 
are the values of the expression contained in the function de nition. The 
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for-expression shown is a loop construct, which is an indicator of potential 
parallelism to the Sisal compiler. This loop has an index range de ned as the 
cross product of two simpler ranges. This means that the body of the loop 
will be instantiated as many times as there are values in the index range, in 
this case n*n, and each body instantiation will be independent, since no data 
dependencies exist among them. The set of independent loop bodies can be 
executed in parallel or not, based on the compiler s and the runtime system s 
analyses of their costs, as well as on options speci ed by the programmer. 

The appearance of the names tl and t2 within the body of the loop 
should not be considered a reuse of these names in the sense of the reassign- 
ment of a variable in an imperative program. Instead, the names are used 
to de ne the computation in the loop body, and in fact these names will 
likely have no real existence within the executing program. The important 
point here is that each instance of the loop body, containing speci c values 
for i and j, will independently compute speci c instances of the values de- 
ned as real(i)*real(j) and real(i)/real(j); then all these separate values will 
be gathered together into a pair of arrays and returned. The positions of the 
values in the result arrays are determined by the loop s index ranges, as are 
the overall size and dimensionality of the returned arrays. In this case, two 
two-dimensional arrays are returned, with index ranges from 1 to n in each di- 
mension. The use of loop-temporary names is optional, and the return-clause 
above could be rewritten as: 

returns array of real (i) *real ( j ) 
array of real (i) /real (j ) 

with no change in the ultimate results. The loop body, then, would appear 
to be empty, but in fact, the language treats the expressions in the array-of 
clause as anonymous temporaries. 

Further syntactic elements of the Sisal language include let-in statements, 
which allow for name de nition and use; if-statements, which allow condi- 
tional name de nition; record and union types, which allow for ITxible data 
aggregation; streams, which allow for producer-consumer computations; and 
sequential loops, which allow true iteration, with speci ed data dependencies 
existing between iterations. I/O in Sisal is performed by passing inputs as 
arguments to, and receiving outputs as the results of, the outermost func- 
tion. The values used for inputs and returned as outputs obey a syntax called 
Fibre , which allows the demarcation of dynamically sized aggregates. 

The Optimizing Sisal Compiler translates source programs into executable 
memory images, including the runtime system components required to au- 
tomatically manage memory, tasking, and I/O. The amount of parallelism 
to be exploited by a program can be controlled by user options, and once 
compiled, a program can be executed by any number of worker processes, by 
way of a single runtime parameter. Similarly, compiler optimization behavior 
and runtime performance can be observed and controlled by options applied 
at various points during compilation and execution. 
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3. An Early Implementation: The Optimizing Sisal 
Compiler 

Early implementations of the Sisal language were basic proofs of concept. 
Various interpreters have been implemented, like DI [36], TWINE [23], and 
SSI [24] but for greatest execution speeds, a compiled code was needed. Sisal 
was ported to novel architectures like the Manchester Datalbw Machine [7], 
but complete acceptance for the language required porting to newly emerging 
shared memory parallel maehines then eoming to market ( EP [1], Encore, 
Sequent [27], Cray). 



3.1 Update in Place and Copy Elimination 

Several key obstacles emerged. The rst of these came from Sisal s seman- 
tic concept of making copies to preserve single assignment and referential 
transparency. A fragment like 

let 

A := array [1: 1,2,3]; 

B := A[2 : 999] ; 
in 

A,B 

end let 

returned [1: 1,2,3], [1: 1,999,3]. The original value of A had to be preserved, 
so a copy was made to enable the replacement. 

Consider instead swapping two elements of an array. In Sisal, one would 
write something like: 

C := array [1: 1,2,3,4,5] ; 

7o Read as E is identical to C 
7o Except index 3 has C [4] in it 
7o Index 4 has C [3] in it 
E := C[3: C [4] ; 4: C [3] ] ; 

The semantics (not the implementation) of Sisal calls for making a copy of 
the array to make the rst replacement, and making another copy for the 
second replacement. A FORTRAN programmer would never do this, instead 
they would write: 

HEMP = IARRAY[3] 

IARRAY[3] = IARRAY[4] 

IARRAY[4] = HEMP 



This program has no (array) copies and is done in place. Similarly, many in- 
place algorithms that are e cient in space and time have been designed in 
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imperative languages that would run poorly if all data structures had to be 
copied. Clearly there was room for improvement [38]. Consider the original 
Sisal swap broken down into individual steps: 

C := array [1: 1,2, 3, 4, 5]; 

TO := C[3] ; 

T1 := C[4] ; 

D := C[3: Tl] ; 

E := D[4: TO] ; 

When the array replacement is done on line 3, the value in C is dead. When 
the replacement is done, the C value can be thrown away. Instead of making 
a copy for D and throwing C away, we can safely use C s container instead. 
Similarly, the second replacement on line 4 is the last use of D, so E can use 
D s container. This analysis is simpli cd because of the functional semantics 
of Sisal. The Optimizing Sisal Compiler (OSC) makes heavy use of update- 
in-place and copy elimination analysis to eliminate many unnecessary copies 
[9]. In its simplest sense, update-in-place migrates reader operations before 
writers. ere, the C[3] and C[4] readers were moved before the replacement 
operations in order to maximize the chances that C would be a dead value 
for the update. 

3.2 Build in Place 

Other important optimizations had to be developed [30]. For instance, many 
functional programs work on pieces of a large structure and then lue the 
computed fragments together. For instance: 

L := F(0,A[1] ,A[2]) ; 

R := F(A[N-1] ,A[N] ,0) ; 

III := for i in 2,n-l ... 

LIII := array_addl(III,L) ; 

LIIIR := array_addh(LIII,R) ; 

Semantically, this says: 1) Build piece L, 2) Build piece R, 3) Build size n-2 
array III, 4) Allocate Tl, a size n-1 array, 5) Copy L and III into Tl, 5) 
Allocate LIIIR, a size n array, 6) Copy LIII and R into LIIIR This seems to 
require two allocations and a two large data copies. OSC introduced the idea 
of BUFFERS and persistent memory to the BACKEND of the compiler 
leaving the frontend unchanged. Using a buGsr system, the same operation 
proceeds as follows: 

Build Buffer LIIIR of size n 
Compute L and put in LIIIR [1] 

Compute R and put in LIIIR [n] 

Compute III in LIIIR [2] . . .LIIIR [n-1] 




The Sisal Project: Real World Functional Programming 



51 



This trick can be played even if the left and right pieces are loops. The 
beauty of this Build-in-place system is that memory can be preallocated 
and parallel computations can simply stick values where they belong even if 
the original computation parts were from distant parts of the computations. 

3.3 Reference Counting Optimization 

We have seen that we can take advantage of an object ending its life just as we 
would otherwise need to copy. Reference counts were introduced to help know 
when: 1) a value can be updated in place and when 2) a value s memory can be 
recycled. Reference counting can be a very expensive operation on sequential 
machines on parallel machines it is much worse!!! Parallel reference counts 
must be updated in a critical section. This operation keeps banging on locks 
every few operations, swamping the machine. Luckily, programs tend to have 
simple patterns of use for aggregate values and OSC can cleverly eliminate 
[35] nearly all reference counting in a program through lifetime analysis and 
operation merging. 

3.4 Vectorization 

On vector machines all the speed advantages come from routing array op- 
erations through temporary vector registers. OSC has ne control of loop 
placement so that reader/writer chains can be established. In imperative 
languages, this generally requires very careful writing of loops in order to 
clearly establish vector relationships between loops. The semantics of Sisal s 
underlying datahbw representation make loops easy to move and so OSC can 
vectorize extremely well [11]. 

3.5 Loop Fusion, ouble BuEtering Pointer Swap, and Inversion 

On scalar and scalar/parallel machines, loop overhead and memory fetch time 
tends to dominate computations. OSC can accommodate these machines by 
applying aggressive loop fusion. Fusion can rewrite loop code like 

TO := for i in l,n returns 
array of A[i]*2 
end for; 

T1 := for i in l,n returns 
array of B[i]*3 
end for; 

:= for i in l,n returns 

array of T0[i] + T1 [i] 
end for; 



X 
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into 



X := for i in l,n returns 

array of A[i]+2 + B[i]*3 
end for; 

eliminating the generation of two temporary arrays and setting values that 
can stream into internal registers from the cache. FORTRAN90 has similar 
semantics for its array operations: X = A*2 + B*3. Sisal s OSC compiler 
can implement this more e ciently than a FORTR AN90 compiler because 
FORTRAN must know absolutely that neither A nor B is aliased to X. Sisal s 
functional semantics insure that a left-hand-side of a de nition isnever an 
alias for a right-hand-side. 

A typical scienti c computation proceeds as follows: 

for initial 

A := start_values 0 
while not done (A) repeat 
A := time_step (old A) 



ere, a new version of A the same size is generated at each time step. A naive 
implementation of Sisal would allocate a new buGfer for each time step and 
throw away the old even though it was the right size. OSC notices this and 
initially allocates a buGer outside the loop and pointer swaps the original and 
secondary buGsrs. 

Gonsider a ID smoothing function that averages values using a three point 
stencil X[i] = (A[i-1] + A[i] + A[i+l])/3.0 At the endpoints a two point 
stencil is used instead. This is most easily expressed as: 

X := for i in l,n 

V := if i = 1 then (A [1] +A [2] ) /2 . 0 

elseif i = n then (A [n-1] +A [n] ) /2 . 0 
else (A[i-1] + A[i] + A[i+l])/3.0 
end if 

returns array of v 
end for 

The if-tests appear to introduce a large overhead and to inhibit parallelism, 
vectorization and pipelining. The loop can be specialized doing the boundary 
computations separate from the inner computations. The inner computation 
is simple: 



inner := for i in 2, n-1 returns 

array of (A [i-1] +A [i] +A [i+1] ) /3 . 0 
end for 
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Now we just need to glue on the lower bound computation and the upper 
bound computation. We need to be careful to handle zero trip loops here! 
This is done by producing an array of for the boundary values. In zero trip 
case, an empty array is generated. In all other cases, an array of size 1 is 
generated. 



leftbound := for i in l,min(l,n) returns 
array of (A [i] +A [i+1] ) /2 . 0 
end for; 

rghtbound := for i in max(2,n),n returns 
array of (A [i-1] +A [i] ) /2 . 0 
end for; 

X := leftBound I I inner I I rghtBound; 

The max/min function calls make sure the zero trip cases are handled 
gracefully. The nal catenation puts the results in the correct form. The 
catenations will actually be removed by build-in-place optimizations later in 
the optimization process. 



4. SisalOO 

The original Sisal de nition has been extended and modernized. The new 
language includes language level support for complex values, array and vec- 
tor operations, static polymorphism and type-sets, higher order functions, 
user-de ned reductions, rectangular arrays, and an explicit interface to other 
languages like FORTRAN and C. See [14] for a detailed description of Sisal90 
and a comparison with Sisal 1.2. 

An important objective was to enhance the language de nition while 
maintaining compatibility with the Sisal 1.2 de nition. We could not de- 
lay our application work waiting for the new de nition and compilers, nor 
disenfranchise the extant Sisal community. Additionally, enhancement rather 
than overhaul implies fewer changes to the backend and permits us to reuse 
existing software. The desired new features prompted a full rewrite of the 
parser and a complete rethinking of how the low-level operations had to be 
speci ed. 

A second objective was to increase Sisal s appeal to scienti c program- 
mers. To this end, we adopted Fortran 90 array operations where possible, 
improved support for mixed-language programming, and included features, 
perhaps not consistent with a strict interpretation of functional dogma, that 
simplify the programmer s task. We do not believe that functional languages 
can survive on their own; however, they can play a critical support role. Most 
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of the code in a large scienti c application pertains to problem sped cation, 
termination, I/O, and fault handling. These sections are not functional and 
not parallel: write them in your favourite imperative programming language. 

owever, often the computational kernel is parallel and functional. ere Sisal 
can play a crucial role, as it can reduce development costs, insure determi- 
nacy, and improve portability without sacri cing performance. We perceive 
a gradual merging of the functional and imperative programming communi- 
ties where functional constructs form either a set of language extensions or 
an integrated core. We hope that the Sisal 90 de nition will accelerate this 
process. 

4.1 The Foreign Language Interface 

Parallel programming traditionally involves the management of concurrent 
tasks and machine resources in addition to the speci cation of the compu- 
tation, greatly increasing the programmer s burden. Most parallel programs 
are not written for parallel execution from the outset. More often, they begin 
as existing sequential programs, written in an imperative language, and are 
augmented with parallel constructs from a vendor-sped c enhanced impera- 
tive language. The programmer who is assigned the task of parallelizing such 
a code must preserve the semantics of the program; the parallel code must 
execute e ciently on the parallel machine of choice and must exhibit some 
scalability; the code should port easily to other parallel machines, in partic- 
ular new generations of the target machine, and development costs should 
be kept as low as possible. As these goals are contradictory, there may be no 
best solution. Since minimizing programming costs is an important objective, 
the programmer usually identi es the most computationally intensive parts 
of the code, and parallelizes only the parts that will provide the most gain 
from parallel execution. By considering only these parallelizable sections, the 
imperative programmer maximizes performance and minimizes development 
costs. The Sisal language supports mixed language programming through its 
Foreign Language Interface (FLI). The FLI allows Sisal programs to call or 
be called from Fortran or C, and to invoke existing libraries or solvers. This 
allows relatively easy recoding of the eomputational kernels of an existing 
code for parallelism. 

The use of the Sisal FLI involves four steps. First, the appropriate level 
of parallelism, and the portion of the original code that contains it must be 
identi ed. The size of computational grains to be parallelized and the amount 
of communication they will do must be considered. There may be one iden- 
ti able code region which is appropriate for parallelization, or there may be 
many several, separated by sequential portions of the code. Second, the data 
that must be communicated into and out of Sisal must be identi cd. This is 
important, since Sisal s functional semantics require a strict separation of in- 
puts and outputs. The mere determination of the input and output data may 
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be nontrivial, since the imperative language may hide the data in global vari- 
ables, common blocks, and aliased variables, and their use as input, output, 
or both, may be di cult to discern or may be situation dependent. The third 
step is usually easy, once the rst two have been achieved; it is the translation 
of imperative source code into Sisal. While no automatic machanisms have 
been developed to do this, due to its dependency on human intelligence and 
information gleaned from the rst two steps, it is can usually be accomp- 
ished by a straightforward set of edits. The fourth step deals with the data 
movement between Sisal and the imperative language, and the initiation and 
termination of the Sisal Run Time System. 

We will not address the rst two steps, mentioned above, as they repre- 
sent an entire genre of know-how and experimentation by themselves. Step 
three will require familiarity with both Sisal and the imperative language 
code under consideration. Since most practically exploitable potential par- 
allelism in existing imperative codes will come from loops, they should be 
examined rst. Sisal does quite well at slicing its parallel loops. Other sorts 
of parallelism, such as function parallelism (independent functions that can 
potentially execute concurrently) and producer-consumer parallelism (e.g. 
software pipelines) are not currently exploited by the Sisal compiler and Run 
Time System, so they can be ignored. Once a loop has been identi cd as a 
target for parallelism, it must be examined for inter-iteration dependencies. 
These will inhibit parallelism, and must be eliminated from the parallel loops 
that result from the translation step. Separate interative loops may need to 
be built in Sisal to handle these portions of the code. The imperative loops 
will often have false dependencies in them arising from the reuse of vari- 
ables where no real data dependency is present. These can be eliminated by 
the use of loop temporary names in Sisal. Imperative loops also often have 
assignments with indexed array names as their targets; these mush also be 
leiminated, and can usually be rewritten with loop temporaries, given appro- 
priate index arithmetic. Once the programmer is used to dealing with these 
exigencies, the translation process can be quick and easy. Following are two 
code fragments illustrating these details. 



temp = 0.0 
Do 100 i = M, N 
temp = temp + A(i) 

B(i) = funcC A, i ) 

C(i) = C(i-l) + A(i) 

A(i) = A(i)*2.0 
100 continue 

The rst loop is in Fortran. Its inputs are a scalar, temp, an array, A and 
an array C; its outputs are the scalar temp, and arrays A, B, and C. Note 
that array C has an index range apparently diGsring from those of A and B 
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by one: it has an element C(M-l), while A and B may not have an element 
indexed less than M. The calculation of temp seems inherently sequential, but 
in fact it can be accomplished in a parallel loop in Sisal. ere is a translation 
of the above loop into Sisal: 



temp, New_A, B, New_C 
for i in M, N 



A_sub_i 
B_sub_i 
C_sub_i 
New_A_sub 



= A[i] ; 

= f 00 ( A , i ) ; 

= C[i-1] + A_sub_i; 
_i := A_sub_i * 2.0; 



returns value of sum A_sub_i 
array of New_A_sub_i 
array of B_sub_i 
array of C_sub_i 

end for 



The syntactic diOerences should be obvious, as should the simplicity of 
the translation between them. It should be noted that the Sisal fragment is 
arti cially lengthened by the presence of the simple expressions in the loop 
body. In fact, all those expressions could be in the returns clause, which would 
make the Sisal loop no longer than the Fortran version. owever, we nd 
clarity to be more important than brevity, in many cases where parallelism 
is the goal and accuracy is at risk, so we tend to use loop temporaries, as 
shown above, to help make Sisal code readable, and to err on the side of of 
readability where style is arguable. 

Step four involves building argument lists for the Sisal Run Time Sys- 
tem to use in invoking the outermost Sisal function, and return value lists 
for the RTS to pass back to the invoking inmperative code. Scalar data can 
simply be passed in and returned without special eGbrt, but arrays are more 
complicated. Arrays in C and Fortran are contiguous blocks of primitive type 
elements stored row- wise (in C) and column-wise (in Fortran). Arrays in Sisal 
are vectors of scalars or vectors, which are contiguous only in the most prim- 
itive dimension, and are stored row-wise. When passing an array between an 
imperative program and a Sisal function, a descriptor must be provided in 
addition to the array, that allows the Sisal Run Time System to correctly 
handle the data and mange the storage it uses. Since all data items in Sisal 
( values , as opposed to variables in imperative languages) are dynamic, 
storage must be managed by the RTS. It does this well, and normally requires 
no help from the programmer. Mixed language programming requires extra 
eGbrts in the form of array descriptors. Each array requires a descriptor, and 
each descriptor contains elds for each dimension of the target array that 
describe that dimension s physical and logical index range, whether the data 
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is read-only or writeable, and whether it must be transposed in passage. For- 
tran arrays, for example, if of more than one dimension, must be transposed 
by the RTS, since they are allocated in column major order in Fortran and 
row major order in Sisal. The descriptors are themselves small arrays which 
must be allocated in the imperative code, and which must be provided for 
each array argument and result. The provision of this information can con- 
ceivably be automated by the compiler, but at present it must be performed 
manually by the programmer. 

In addition to the above, the Sisal Run Time System must be started 
and stopped at points in the imperative program that are appropriate to the 
parallel work that will be done. Normally, the RTS is started and stopped 
automatically during the execution of a pure Sisal program, but this code 
must be explicitly included and invoked in the loading and execution of a 
hybrid program. Since it is expensive in terms of CPU time to do this, it 
is not appropriate that it be done repeatedly within a loop that contains 
calls to the Sisal code. Rather, the R.TS should be started once, the Sisal 
code invoked wherever appropriate, and the RTS should be shut down before 
the normal termination of the program. It costs relatively little to leave the 
RTS running between invocations of the Sisal code, so this method is not 
particularly wasteful of machine resources. The RTS is started by a simple 
call which contains a few of the parameters normally used in te execution 
of a Sisal program. These include the program heap size (the memory pool 
used by the RTS), the number of worker processes to be used (the amount 
of parallelism exploited) . 

At this point it is worth mentioning that the Sisal FLI was built as an 
experiment, and as such is still in a somewhat rougher state than would be 
desired in a production parallelization system. Its use, as documented above, 
can present di culties that can eGectively undo some of the advantages of 
applicative paralleism. For instance, the generation of the array argument 
and result descriptors adds to the programmer s burden. In addition, arrays 
of dimension greater than one will currently be copied across the FLI, a source 
of overhead at exectute time that is inimical to parallel performance goals. 
Therefore, in the work we have done with it, we have routinely used aliasing 
to hide the multidimensional nature of such arrays, and index arithmetic to 
allow arbitrary access to their elements. 

In addition, we must confess that the goals of machine independence are 
not always met in parallel programming, and this is at least as true in mixed 
language programming for parallelism. It sometimes happens that the Sisal 
code resulting from the translation of step three, above, must be modi ed for 
performance purposes. For instance, column-wise accesses to two-dimensional 
arrays usually causes performenace degradation in systems containing cache 
memories. owever, this by itself is usually no more serious a constraint than 
would be imposed by such system architectures during a parallel port in any 
other language. 
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Notwithstanding the problems mentioned above we believe the FLI of- 
fers two distinct advantages to the parallel programmer. First, it provides 
a means of rapidly parallelizing existing application codes by concentrating 
programmer eGbrt where it will provide the best return. And second, it of- 
fers a developmental path for codes ranging from experimentation on cheap 
workstations to production on expensive supercomputers. 



5. A Prototype Distributed-Memory SISAL Compiler 

In this section we present D-OSC, a prototype SISAL compiler for distributed- 
memory machines. D-OSC is an extension of OSC [12]. A new analysis phase 
for loop and array distribution has been added and the code generation phase 
has been modi ed to produce C plus MPI [15] calls. The run-time system has 
been modi ed to support array distribution and communicating threads. In- 
formation needed to perform distributed memory optimizations is established 
by the analysis phase and provided to the code generator by decorating the 
appropriate IF2 nodes and edges. 

The D-OSC model of execution is activation-based. A master process 
is responsible for dividing parallel loops into slices which will be executed 
by slave processes running in parallel. A slice is represented by an activation 
record, which contains a code pointer, the loop range, a unique loop identi er, 
input parameters to the slice, and destinations for values to be returned upon 
termination. Activation records are distributed over the machine and each 
processor maintains a local activation record queue. Upon completion of a 
slice, the slave process sends a completion message to the master and updates 
global results with locally-computed values. As a slice may contain a parallel 
loop, each slave can become a master and distribute its inner loop. Each 
processor must be able to receive a request for service from other processors, 
such as a read, write or allocate request. This is achieved by having a listener 
thread always active on every processor. 

D-OSC is implemented in four phases, where each phase relies on the 
previous one. 

J Base. This phase employs no analysis whatsoever, hence the code generated 
is very naive. Arrays and loops are distributed equally among processors. 
Message passing is used to access remote array elements. This compiler 
version serves as a reference for further implementations, providing useful 
information about the eOectiveness of certain optimizations. 

J Rectangular Arrays. The standard implementation of higher-dimensional 
arrays as arrays of arrays is replaced, where possible, by rectangular arrays 
with a single descriptor. Arrays and the loops creating or using arrays can 
be distributed by rows, block or columns. Not all loops are distributed. 
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J Block Messages. The reading and writing of remote array elements within 
certain loops is optimized by combining all the messages directed to the 
same processor into a single block message. 

J Multiple Alignment. In previous phases arrays partitioning created disjoint 
sections of an array. In this phase overlapping array sections are created. 
This optimization reduces the number of messages passed, at the cost of 
using more space for the overlapping array sections. 



5.1 Base Compiler 

In OSC, the representation of arrays consists of an array descriptor, which 
contains information such as bounds, reference count, size, and other infor- 
mation, and a pointer to the physical array. OSC assumes a shared-memory 
model, and the pointers to the array descriptor provide a unique array iden- 
ti er. An evident problem on a distributed-memory machine is that the de- 
scriptor pointer cannot be used as a unique identi er, since the address of the 
array descriptor is diCTrent for each processor. ence a unique array iden- 
ti er is created explicitly as the index in an array table that exists on each 
processor. The design of the array table permits a great deal of compatibil- 
ity with existing array operations since the OSC concept of a unique array 
descriptor is preserved. 

Arrays are partitioned according to the distribution of the creating loop. 
In the Base compiler each loop, and hence each array dimension, is distributed 
equally among processors. To create the unique identi er for distributed ar- 
rays, the master process that creates the loop slices, allocates the array iden- 
ti er and sends it as part of the activation message to the slaves. Each slave 
then executes a slice in parallel and updates its local entry in the array table. 

Array access in the base compiler is straight-forward. The processor that 
owns the array element is determined. In the base case this amounts to a 
simple computation involving the array size and the number of processors. If 
the owner is the local processor, the array element is read directly from local 
memory, otherwise a request message is sent to the listener thread of the 
processor that owns the array element. The listener thread directly performs 
the array access. 



5.2 Rectangular Arrays 

Rectangular arrays have only one descriptor per array, regardless of its dimen- 
sionality. Only one possibly remote memory access to fetch the array element 
is needed, where an arrays of arrays implementation of an nD array requires 
n memory accesses to fetch an element. With one array descriptor per array 
traditional distributions, such as row, block and column, are easier to im- 
plement. A disadvantage of rectangular arrays is that sub-arrays cannot be 
shared. owever, sharing also has disadvantages since update-in-place cannot 
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be performed, and access functions are less e dent. Another disadvantage 
of rectangular arrays is that ragged arrays cannot be represented. 

Arrays are created using IF2 AGather nodes. Consider the case of a Sisal 
triple cross product for loop that returns a three-dimensional array. In the 
original IF2, AGather nodes in the result graphs of all three nested loops 
create arrays. In the rectangular array case, the actions that the various 
AGather nodes perform are diCirent. The outermost AGather node must 
perform the allocation of the physical space for the whole 3D-array, and the 
allocation of the single array descriptor. The innermost AGather node 11s 
in the elements of the array. The AGather node in the middle loop does not 
perform any action. 

In the original IF2 an arrays of arrays access consists of multiple AElement 
nodes scattered over the dependence graph, each with one index input. For 
a rectangular array this must be transformed into one AElement node with 
all indices as input. The analysis phase identi es thetree of AElement nodes 
that is spanned by the output edge of a root AElement node and marks 
these nodes with information such as the level of the node in the tree and 
back-edges to ancestor nodes. 



5.3 Block Messages 

The implementation of array access operations described above is not always 
e cient for array references in loop bodies, as performing remote exchanges 
for individual elements is less c cient than performing at most one block 
exchange per producer-consumer processor pair. Our algorithm for obtaining 
block messages is a modi cation of the algorithm presented in [16]. 

5.4 Multiple Alignment 

The last phase of the compiler implements the overlapping allocation of array 
sections presented in [17] for one-dimensional arrays. Overlapping allocation 
is applied to loops with restricted a6 ne references as in the following loop 
model, where the cjs are constants. 

for i in lo, hi 

returns array of f (B1 [i+cl] , . . . , Bm [i+cm] ) 
end for 

In the case of single alignment, i.e. m = 1, the rst element of the con- 
sumer array is aligned with element 1 + Ci of the producer array. For the 
general case, the analysis phase identi es restricted a ne loops, that create 
one-dimensional arrays while accessing elements from other one-dimensional 
arrays. Multiple alignment is achieved by identifying all the unaligned refer- 
ences required, and the maximum and minimum oGkets of these with respect 
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to the consumer index. The contiguous set of indices thus obtained is a super- 
set of the producer array elements needed. Loops are marked RightOverlap 
and LeftOverlap to be used in the code generation phase to determine the 
upper and lower bounds for each slice. 



5.5 Results 

The benchmark programs used here to assess the eGsctiveness of the various 
optimization phases are Livermore loops 1, 2, 3, 6, 7, 9, 12, 21, and 24, run 
on a network of four workstations. Since the initial objective is to reduce 
communication, we measure the total number of messages exchanged - the 
rst number in table 5.1, and the total volume of communication - the second 
number in the table. 



Table 5.1. Number of Messages, Communication Volume (4 PEs). 



Program 


Type 


Base 


Rect Arrays 


Block Mssgs 


Multiple Algn 


111 


ID 


6605, 132132 


6603, 211368 


603, 31368 


303, 12168 


112 


ID 


6443, 126656 


6443, 213128 


6443, 213128 


6443, 213128 


113 


ID 


3, 96 


3, 168 


3, 168 


3, 168 


116 


ID, 2D 


10533, 213036 


13223, 430408 


13223,430408 


13223, 430408 


117 


ID 


4807, 86568 


7503, 225168 


953, 18968 


303, 8568 


119 


2D 


5883, 117136 


2403, 76968 


603, 28968 


603, 28968 


1112 


ID 


9005, 180132 


3003, 96168 


1503, 24168 


3, 168 


1121 


2D 


471, 8520 


14403, 460968 


123, 58728 


123, 58728 


1124 


ID 


29703, 594096 


29703, 950568 


29703, 950568 


29703, 950568 



Rectangular arrays decrease the number of messages exchanged for some 
of the programs that use 2-D arrays. owever, sometimes the number of 
messages increases, as in loop 21. The reason for this is that the partitioning 
of the loops and arrays performed by the base compiler matches the accesses 
of the array elements better than the rectangular arrays implementation. 

Most of the programs that access arrays bene t greatly from the imple- 
mentation of block messages. The greatest improvements occur for loops 1 
and 21. Loops 2 and 24 are sequential and the current implementation only 
generates block messages for references accessed in parallel loops. Loop 6 
contains subscript expressions that use non loop variables. 

Vlultiple alignment reduces the number of messages for the programs with 
producer consumer relations of one-dimensional arrays, such as loops 1,7 and 
12 . 

The volume of communication does not always decrease and varies with 
program characteristics. In loop 24, where the number of messages exchanged 
remains the same for all the compiler phases, the volume of communication 
increases. This is because the implementation of rectangular arrays increases 
the size of messages required to access array elements in order to accommo- 
date the multiple indices of rectangular arrays. 
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5.6 Further Work 

D-OSC is a prototype implementation that helps us to quantify compiler op- 
timizations for distributed-memory machines. The following are some of the 
tasks that must be performed to improve D-OSC. A more e6 dent run-time 
system is needed. There are situations where run-time referenee counting is 
necessary. If one processor owns a reference count, each remote processor that 
updates the reference counter must contact this processor. When deallocat- 
ing an array, the responsible processor must notify all processors that have 
partial copies of the array to deallocate the space. The implementation of 
function call parallelism is very easy under the activation-based model. ow- 
ever, inter- functional analysis is required to determine when and where to 
spawn functions. Currently loops are always distributed over all processors. 
If an analysis phase can estimate the computation cost of a loop body, then 
it is possible to generate code that decides the number of processors to be 
used. Parallel I/O must be implemented. 



6. Architecture Support for Multithreaded Execution 

Multithreaded execution has been proposed as a model for parallel program 
execution. As a model, or rather a family of models, multithreading views 
a program as a collection of concurrently executing sequential threads that 
are asynchronously scheduled based on the availability of data. This de ni- 
tion is intentionally wide in that it attempts to capture the common features 
among various multithreaded execution models proposed to date. It is impor- 
tant to note that in this de nition the multithreaded execution model does 
not specify any form of memory hierarchy (it is common though to expect a 
single logical address space, shared by many threads and mapped over several 
nodes), any sped c language feature, whether threads are user sped ed or 
compiler generated, the mechanism for communication and/or synchroniza- 
tion among threads, or the order of thread execution. There is no standard 
de nition of a thread. In this document we will de ne a thread as the set 
of sequential instructions executed between two synchronization points. Note 
that this de nition does not preclude any architecture from exploiting the 
instruction level parallelism within a thread or the locality of access to a 
storage hierarchy. 

Because of its functional properties, the Sisal language is particularly well 
suited as a source for multithreaded code. In this Section we present some re- 
sults related to the evaluation of multithreaded execution. The performance of 
multithreaded execution is determined by the complex interaction of a num- 
ber of inter-related architectural and compilation issues such as code genera- 
tion, thread ring rules, synchronization schemes and thread scheduling. The 
relation between these issues and the tradeoGh between various alternatives 
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for each of these issues is complex and requires extensive experimental eval- 
uation. For example, the thread ring rule (which determines when threads 
are enabled) can be based on either a blocking or a non-blocking strategy. The 
blocking strategy is adopted in lannucci s ybrid Architecture [20] , the Tera 
MTA [2] and the EART machine [19]. The non-blocking strategy is adopted 
in Monsoon [28,29], *T [25] and the EM-4 [34] among others. The Threaded 
Abstract Alachine (TAM) [13] is a software implemented multithreaded ex- 
ecution that has been ported to a number of platforms (such as the TMC 
CM-5 and the Cray T3D), it implements the non-blocking model. 

In this section we summarize the results of an experimental and quantita- 
tive evaluation of these two execution models. The evaluation includes their 
respective code generation strategies, its implications on data distribution 
and access and the performance of their respective storage hierarchies. 

6.1 Blocking and Non-blocking Models 

The two multithreaded execution models considered here are based on data- 
driven dynamic execution with statically generated threads. This section 
presents a detailed description of these two models. 

The blocking thread execution model:. In this model a thread may be sus- 
pended and its execution resumed later. This model requires the underlying 
architecture to support context switching: i.e., the saving of the thread state 
and the selection of a new thread. Usually, a thread is suspended after initi- 
ating a long latency operation such as a remote memory access. 

In this model the synchronization and storage mechanisms rely on the 
Frame model: A frame represents a storage segment associated with each 
invocation of a code-block^. The Frame model is used in several multithreaded 
machines (e.g. TAM [13], StarT-N [3] and the EM-4 and EM-X [21]). All the 
threads within the code-block instance refer to its associated frame to store 
and load data values. Frames are of variable size and contiguously allocated in 
the virtual address space. The size of a frame is determined by the maximum 
number of data values associated with the code-block. When an instance of a 
particular code-block is invoked, a frame is rst allocated in local memory of 
a processor and all the data values generated within that code-block instance 
will be stored in that frame. The virtual address carried by a token is of the 
form: 

<frame pointer, frame o seT 

A synchronization slot in the frame is associated with each thread. The 
synchronization slot is a counter initialized with the count of the number of 
the inputs to the thread and is decremented with the arrival of each input. 
The thread is ready when the count reaches zero. A data value that is shared 
(i.e. read) by several threads in the same frame occupies only one location. 

^ A code-block is a semantically distinguishable unit of code such as a loop or 

function body. 
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Data values generated by the executing threads are sent to the Synchroniza- 
tion Unit which writes them in the frame. The frame is deallocated when all 
the threads in the code-block have terminated. 

The non-blocking thread execution model:. In this model a thread is acti- 
vated only when all its input parameters are available. Therefore, once a 
thread starts its execution it runs until termination. All memory accesses are 
performed as split-phase accesses: the request is issued by a thread but the 
result is returned to another thread. In this mode the thread never has to 
block, and be switched out, while waiting for a remote memory access. 

The synchronization and storage mechanisms for the non-blocking threads 
is the Framelet model. A framelet is a xed sized unit of storage that is 
associated with each thread instance. Each framelet has one synchronization 
slot for that thread instance. In the Framelet model a data value that is 
shared among several threads within a same code-block would be replicated 
in the framelet of each thread instance. The framelet is deallocated when the 
thread instance completes its execution. Because their size is xed, framelets 
are aligned with cache blocks. The virtual address of a data value in the 
Framelet model is of the form: 

<context #, thread framelet o set> 

Example.. A code-block consisting of four threads is shown in Figure 6.1. 
The corresponding Frame memory model is shown in the Figure 6.2. The 
input a which is used by both threads A and B is stored at only one place 
in the frame memory. Each of the values in the frame memory is accessed by 
the frame base address and the oGfeet into the frame. The rst four slots are 
the counters for the threads. Thus when value c is stored only the counter 
for D is decremented. But when a is stored both counter for A and B are 
decremented but only one copy of a is stored in the frame. 

The Framelet memory model corresponding to the same code block is 
shown in Figure 6.3. There are four separate framelets. Each framelet con- 
tains the counter for the corresponding thread and a memory location for all 
the inputs to the thread. ence frameletA corresponds to one particular ac- 
tivation of thread A. The a is stored in the framelets of both threads A and 
B and both counters are decremented. This accomplished as two separate 
store operations. 

6.2 Code Generation 

The source language used for the generation of multithreaded code is Sisal. 
The compilation process converts the programs into two intermediate forms: 
MIDC-2 (non-blocking) and MIDC-3 (blocking) which are both derived from 
the Machine Independent DataHow Code (MIDC) [33]. AIIDC is a graph 
structured intermediate format: The nodes of the graph correspond to the 
von Neumann sequence of instructions and the edges represent the transfer of 
data between the nodes. MIDC has been used to generate the executable code 
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Fig. 6.1. Code block with Three threads. 




Fig. 6.2. Frame klcmory Representation. 




Address Framelet A 



Address Framelet B 



Address Framelet C 



Address Framelet D 



Fig. 6.3. Framelet Memory Representation. 
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for other multithreaded machines (e.g Monsoon and EM-4). Both MIDC-2 
and MIDC-3 are highly optimized codes with optimization done both at the 
inter- and intra-thread level. 

The code generation compiler is guided by the following objectives [32] : 

J Minimize synchronization overhead: by merging threads (thread fusion) 
and by allocating related threads to the same code block (in the blocking 
model). 

J Maximize intra-thread locality: achieved by thread fusion. 

J Assure deadlock-free threads: circular dependencies can create a potential 
for deadlock. 

J Preserve functional and loop parallelism in programs. 

The rst phase of the code generation is the same for both models, it 
involves compiling the Sisal programs to IF2 using OSC [10]. 

The second phase diGfers for the two models in the handling of struc- 
ture store accesses and the data storage models (frames or framelets). The 
long latency operations consist of remote memory reads, memory allocations, 
function calls and remote synchronizations. The remote memory references 
can be handled either as a split-phase access or a single-phase access. In the 
split-phase access the request is sent by one thread and the result is for- 
warded to another thread. In the single-phase access the result is returned 
to the same requesting thread. In the non-blocking model all remote accesses 
are split-phase. The blocking model uses both types of accesses: the code 
is analyzed at compile time to identify remote and local accesses. Remote 
accesses are implemented by split-phase operations while local accesses are 
regular memory access. 

J In the non-blocking model (MIDC-2 form) all structure store accesses are 
turned into split-phase accesses. A split-phase access terminates a thread: 
the request is sent by a thread but the result is returned to another thread. 
In this model a thread has never to block on a remote memory access. This 
model does not make any assumption regarding data structure distribution. 

J In the blocking model (MIDC-3) the IF2 graph is statically analyzed to 
diCferentiate between local and remote structure store accesses: a local ac- 
cess does not terminate a thread while a remote one does. If the result of a 
structure store access is used within the same code-block where the access 
request is generated, the access is considered local. In this case, the thread 
will block until the request is satis ed. This model relies on a static data 
distribution to enhance the locality of access. Note that a data structure is 
often generated in one code block and used in several others in which case 
only one of the consumer code-blocks would have a local access. 

Example. The example in Figure 6.4 demonstrates the diOorence between 
MIDC-2 and MIDC-3. In MIDC-2, Thread 255 performs a structure memory 
read operation. The read is performed as a split-phase access where the result 
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MIDC-2 FORMAT 



^ Output List 



=» Splitphase Read 
> Out instruction that 
sends token to 
thread# 256,port#2. 



> Add 1 to Input# 2 that 
came from splitphase read 

Mutliply R2 with Input# 1 
which came from OUT of 
thread#255. 



Uniphase access and 
^ Thread Blocked for result 



Result of Uniphase access 
is available in R6. 



^ Input#l intheMIDC2 
form is not needed here. 

Instead use local register R3 

Fig. 6.4. MIDC-2 and MIDC-3 code examples. 



CodeBlock 
N255 <> 

i R5 = ADD R4,R3 
I R6=RSL R5,”2",R1 
H 

R2 = ADDR6,"1" 
R3 = MUL R3,R2 



MIDC-3 FORMAT 



is sent to Thread 256. Thread 255 does not block, it continues execution 
until termination. When the results of the split-phase read is available it is 
forwarded to Thread 256 which starts execution when all its input data is 
available. There are no restriction on the processor on which Thread 255 
and Thread 256 are executed. 

In the MIDC-3 code, Thread 255 and 256 belong to the same code-block. 
The read structure memory operation is a local single phase operation. ence, 
the two threads become a single thread. The thread blocks when the read 
operation is encountered and waits for the read request to be satis ed. 

Discussion of the Models.. The main diOerences between the blocking and 
non-blocking models lie in their synchronization and thread switching strate- 
gies. The blocking model requires a complex architectural support to e - 
ciently switch between ready threads. The frame space is deallocated only 
when all the thread instances associated with its code block have termi- 
nated execution which is determined by extensive static program analysis. 
The model also relies on static analysis to distribute the shared data struc- 
tures and therefore reduce the overhead of split-phase accesses by making 
some data structure accesses local. The non-blocking model relies on a simple 
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scheduling mechanism: data-driven data availability. Once a thread completes 
execution, its framelet is deallocated and the space is reclaimed. 

The main diGsrence between the Frame model and the Framelet models 
of synchronization is the token duplication. The Framelet model does require 
that variables which are shared by several threads within a code block be 
replicated to all these threads while in the Frame model these variables are 
allocated only once in the frame. The advantage of the Framelet model is that 
it is possible to design special storage schemes [31] that can take advantage of 
the locality of the inter-thread and intra-thread locality and achieve a cache 
miss rate close to 1%. 

6.3 Summary of Performance Results 

This section summarizes the results of an experimental evaluation of the two 
execution models and their associated storage models. A preliminary version 
of these results was reported in [4] , detailed results are reported in [5] . 

The evaluation of the program execution characteristics of these two mod- 
els shows that the blocking model has a signi cant reduction in threads, in- 
structions, and synchronization operations executed with respect to the non 
blocking model. It also has a larger average thread size (by 26% on average) 
and, therefore, a lower number of synchronization operations per instruction 
executed (17% lower on average). 

owever, the total number of accesses to the Frame storage, in the non- 
blocking model, is comparable to the number of accesses to the Framelet 
storage in the blocking model. Although the Frame storage model eliminates 
the replication of data values, the synchronization mechanism requires that 
two or more synchronization slots (counters) be accessed for each shared data. 
The number of synchronization accesses to the frames nearly oGhets all the 
redundant accesses. In fact the size of the trace of accesses to the frames 
is less than 3% smaller than the framelet trace size. ence, synchronization 
overhead is the same for the frame and framelet models of synchronization. 

The evaluation also looked at the performance of a cache memory for the 
Frame and Framelet models. Both models exhibit a large degree of spatial 
locality in their accesses: In both cases the optimal cache block size was 256 
bytes. owever, the Framelet model has a much higher degree of temporal 
locality resulting in an average miss rate of 1.82% as opposed to 5.29% for 
the Frame model (both caches being 16KB, 4-way set associative with 256 
byte blocks). 

The execution time of the blocking model is highly dependent on the suc- 
cess rate of the static data distribution. The execution times for success rates 
of 100% or 90% are comparable and outperform those of the non blocking 
model. For a success rate of 50%, however, the execution time may be higher 
than that of the non blocking model. The performance, however, depends 
largely on the network latency. When the network latency is low and the 
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processor utilization high, the non blocking model performs as well as the 
blocking model with a 100% or 90% success rate. 



7. Conclusions and Future Research 

The functional model of computation is one attempt at providing an im- 
plicitly parallel programming paradigm^. Because of the lack of state and 
its functionality, it allows the compiler to extract all available parallelism, 
ne and coarse grain, regular and irregular, and generate a partial evalua- 
tion order of the program. In its pure form {e.g., pure Lisp, Sisal, askell), 
this model is unable to express algorithms that rely explicitly on state. ow- 
ever, extensions to these languages have been proposed to allow a limited 
amount of stateful computations when needed. Instead, we are investigat- 
ing the feasibility of the declarative programming style, both in terms of its 
expressibility and its run-time performance, over a wide range of numerical 
and non- numerical problems and algorithms, and executing on both conven- 
tional and novel parallel architectures. We are also evaluating the ability of 
these languages to aid compiler analysis to disambiguate and parallelize data 
structure accesses. 

On the implementation side, we have demonstrated how multithreaded 
implementations combine the strengths of both the von Neumann (in its ex- 
ploitation of program and data locality) and of the data-driven model (in 
its ability to hide latency and support e dent synchronization). New archi- 
tectures such as TERA [2] and *T [26] are being built with hardware sup- 
port for multithreading. In addition, software multithreading models such as 
TAM [13] and MIDC [8]), are being investigated. 

We are currently further investigating the performance of both software- 
supported and hardware-supported multithreaded models on a wide range of 
parallel machines. We have designed and evaluated low-level machine inde- 
pendent optimization and code generation for multithreaded execution. The 
target hardware platforms will be stock machines, such as single superscalar 
processors, shared memory, and multithreaded machines. We will also target 
more experimental datakbw machines, {e.g., Monsoon [18,37]). 
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1. Introduction 

The High Pci'forniance CH — h consortimn is a group that has been working 
for the last two years on the design of a standard library for parallel pro- 
gramming based on the GH — h language. The consortium consists of people 
from research groups within Universities, Industry and Government Labora- 
tories. The goal of this eFort is to build a common foundation for constructing 
portable parallel applications. The design has been partitioned into two lev- 
els. Level 1 consists of a speci cation for a set of class libraries and tools 
that do not require any extension to the C+-|- language. Level 2 provides the 
basic language extensions and runtime library needed to implement the full 
HPC-I-+ Level 1 speci cation. 

Our goal in this chapter is to brie y describe part of the Level 1 speci - 
cation and then provide a detailed account of our implementation strategy. 
Our approach is based on a library, IIPC+-|-Lib, which is described in de- 
tail in this document. We note at the outset that HPC-|--|-Lib is not unique 
and the key ideas are drawn from many sources. In particular, many of the 
ideas originate with K. Mani Chandy and Carl Kesselman in the CC-I--I- lan- 
guage [6, 15] and the MPC+-I- Multiple Threads Template Library designed 
by Yutaka Ishikawa of RWCP [10], the IBM ABC+-I- library [15,16], the Ob- 
ject Management Group CORBA speci cation [9] and the Java concurrency 
model [1]. 

In particular, Carl Kesselman at USC ISI is also building an implemen- 
tation of IIPC-I— I- using CC-I--I- as the level 2 implementation layer. Our im- 
plementation builds upon a compiler technology developed in collaboration 
with ISI, but our implementation strategy is diFerent. 

The key features of HPC+J-Lib are 

A Java style thread class that provides an easy way to program parallel 
applications on shared memory architectures. This thread class is also used 
to implement the loop parallelization transformations that are part of the 
HPC++ level 1 speci cation. 

A template library to support synchronization, collective parallel opera- 
tions such as reductions, and remote memory references. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 73-107, 2001. 
Springer-Verlag Berlin Heidelberg 2001 
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A CORBA style IDL-to-proxy generator is used to support member func- 
tion calls on objects located in remote address spaces. 

This chapter introduces the details of this programming model from the 
application programmer s perspective and describes the compiler support re- 
quired to implement and optimize HPC++. 



2. The HPCH — h Programming and Execution Model 

The runtime environment for HPC-I--I- can be described as follows. The basic 
architecture consists of the following components. 

A node is a shared- memory multiprocessor (SMP), possibly connected to 
other SMPs via a network. Shared memory is a coherent shared-address 
space that can be read and modi ed by any processor in the node. A node 
could be a laptop computer or a 128-processor SGI Origin 2000. 

A context refers to a virtual address space on a node, usually accessible 
by several diFerent threads of control. A Unix process often represents a 
context. We assume that there may be more than one context per node in 
a given computation. 

A set of interconnected nodes constitutes a system upon which an HPC++ 
program may be run. 

There are two conventional modes of executing an HPC+-I- program. The 
rst is ' multi-threaded, shared memory^ where the program runs within one 
context. Parallelism comes from the parallel-loops and the dynamic creation 
of threads. Sets of threads and contexts can be bound into Groups and there 
are collective operations such as reductions and pre x operators that can be 
applied synchronize the threads of a group. This model of programming is 
very well suited to modest levels of parallelism (about 32 processors) and 
where memory locality is not a serious factor. 



Node 1 



Node 2 



Node 3 





oatExt 3 
(3 threads) 




oatHxt 4 
(1 thread) 



Fig. 2.1. A SPMD program on three nodes with four contexts. Each context 
may have a variable number of threads. 
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The second mode of program execution is an explicit ’ Single Program 
Multiple Data' (SPAdD) model where n copies of the same program are run on 
n diFerent contexts. This programming model is similar to that of Split-C [7], 
pC++ [15], AC [5] or C/C++ with MPI or PVM in that the distribution 
of data that must be shared between contexts and the synchronization of 
accesses to that data must be managed by the programmer. HPC++ diFers 
from these other C-based SPMD systems in that the computation on each 
context can also be multi-threaded and the synchronization mechanisms for 
thread groups extends to sets of thread groups running in multiple contexts. 
It should also be noted that an SPAID computation need not be completely 
homogeneous: a program may contain two contexts on one node and one 
context on each of two other nodes, urthermore, each of these contexts may 
contain a variable number of threads (see igure 2.1). 

Multi-context SPMD programming with multi-threaded computation 
within each context supports a range of applications, such as adaptive grid 
methods for large scale simulation, that are best expressed using a form of 
multi-level parallelism. 

2.1 Level 1 HPC+ + 

The level 1 library has three components. The rst component is a set of 
simple loop directives that control parallelism within a single context. The 
compiler is free to ignore these directives, but if there is more than one 
processor available, it can use the directives to parallelize simple loops. 

The HPC++ loop directives are based on ideas from HP [8] and other 
older proposals. The idea is very simple. The HPC++ programmer can iden- 
tify a loop and annotate it with a ^pragma to inform the compiler it is 
’ independent' . This means that each iteration is independent of every other 
iteration, and they are not ordered. Consequently, the compiler maty choose 
to execute the loop in parallel, and generate the needed synchronization for 
the end of the loop. In addition, variables that do not carry loop dependences 
can be labeled as PRIVATE so that one copy of the variable is generated for 
each iteration, urthermore, in the case of reductions, it is possible to label a 
statement with the REDUCE directive so that the accumulation operations 
will be atomic. As a simple example, consider the following function which 
will multiply a.n n by n matrix with a vector. 

This function may generate up to parallel threads because both loops 
are labeled as HPCJNDEPENDENT. However, the compiler and the run- 
time system must work together to choose when new threads of control will be 
created and when loops will performed sequentially. Also, each iterate of the 
outer loop uses the variable imp, labeled PRIVATE, to accumulate the inner 
product. The atomicity of the reduction is guaranteed by the HPC_REDUCE 
directive at the innermost level. 
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void Matvec (double **A, int n, double *X, 
double *Y){ 
double tmp; 

#pragma HPC_ INDEPENDENT, PRIVATE tmp 
for(int i = 0; i < n; i++){ 
tmp = 0; 

#pragma HPC_ INDEPENDENT 
forCint j = 0; j < n; j++){ 

#pragma HPC_REDUCE. 
tmp += A [i] [j] *X [j] ; 

} 

y[i] = tmp; 

> 

} 

In section 5 below we will describe the program transformations that the 
compiler must undertake to recast the annotated loop above into a parallel 
form using the HPC++ Thread library. 

2.2 The Parallel Standard Template Library 

As described above, there are two execution models for HPC++ programs. 

or the single context model, an HPC++ program is launched as an ordinary 
C++ program with an initial single main thread of control. If the context is 
running on a node with more than one processor, parallelism can be exploited 
by using parallel loop directives, the HPC++ Parallel Standard Template 
Library (PSTL), or by spawning new threads of control, or multiple context 
execution, an HPC++ program launches one thread of control to execute the 
program in each context. This Single Program Multiple Data (SPMD) mode 
is a model of execution that is easily understood by programmers even though 
it requires the user to reason about and debug computations where the data 
structures are distributed over multiple address spaces. The HPC++ library 
is designed to help simplify this process. 

One of the major recent changes to the C++ standard has been the 
addition of the Standard Template Library (STL) [13, 14]. The STL has ve 
basic components. 

Container class templates provide standard de nitions for common aggre- 
gate data structures, including vector, list, deque, set and map. 

Iterators generalize the concept of a pointer. Each container class de nes 
an iterator that gives us a way to step through the contents of containers 
of that type. 

Generic Algorithms are function templates that allow standard element- 
wise operations to be applied to containers. 
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Function Objects are created by wrapping functions with classes that typi- 
cally have only operator () de ned. They are used by the generic algorithms 
in place of function pointers because they provide greater e ciency. 
Adaptors are used to modify STL containers, iterators, or function objects. 

or example, container adaptors are provided to create stacks and queues, 
and iterator adaptors are provided to create reverse iterators to traverse 
an iteration space backwards. 

The Parallel Standard Template Library (PSTL) is a parallel extension 
of STL. Distributed versions of the STL container classes are provided along 
with versions of the STL algorithms which have been modi ed to run in 
parallel. In addition, several new algorithms have been added to support 
standard parallel operations such as the element-wise application of a function 
and parallel reduction over container elements, inally, parallel iterators have 
been provided. These iterators extend global pointers and are used to access 
remote elements in distributed containers. 

2.3 Parallel Iterators 

STL iterators are generalizations of C+-|- pointers that are used to traverse 
the contents of a container. HPC++ parallel iterators are generalizations of 
this concept to allow references to objects in diFerent address spaces. 

In the case of random access parallel iterators, the operators ++, (S>0, 
+n,0n, and [i] allow random access to the entire contents of a distributed 
container. In general, each distributed container class C , will have a subclass 
for the strongest form of parallel iterator that it supports (e.g. random access, 
forward or bidirectional) and a begin and end iterator functions, or example, 
each container class will provide functions of the form 

template <class T> 
class Container! 

class pariterator! ... } 
pariterator parbeginO ; 
pariterator parendO ; 

>; 

2.4 Parallel Algorithms 

In HPC-I--I- PSTL there are two types of algorithms. irst are the conventional 
STL algorithms like for.each(), which can be executed in parallel if called 
with parallel iterators. The second type includes STL algorithms where the 
semantics of the algorithm must be changed to make sense in a parallel 
context, as well as several new algorithms that are very common in parallel 
computation. Algorithms in this second group are identi ed by the pre x 
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par-, and may be invoked with the standard random access iterators for single 
context parallelism or with parallel random access iterators for multi-context 
SPMD parallelism. 

The most important of the new parallel algorithms in HPCH — h STL are 

par-apply (begin 1 , endl, begin2, beginS, f()) which applies a function 

object pointwise to the elements of a set of containers. 
par-reduction (begin 1, endl, begin2, beginS, reduce(),f()) which is a 
parallel apply followed by a reduction on an associative binary operator. 
parscan(result-begin, result-end, begin2, beginS, ■■■., scanop(), f()) which 
is a parallel apply followed by a parallel pre x computation. 

2.5 Distributed Containers 

The HPC-I--I- container classes include versions of each of the STL contain- 
ers pre xcd by the phrase distributed- to indicate that they operate in a 
distributed SPMD execution environment. Constructors for these containers 
are collective operations, i.e. they must be invoked in each executing con- 
text in parallel, or example, a distributed vector with elements of typeT is 
constructed with 

distributed_vector < T > X(dimO, &distrib_obj ect) ; 

The last parameter is a distribiition object which de nes the mapping of 
the array index space to the set of contexts active in the computations. If 
the distribution parameter is omitted, then a default block distribution is 
assumed. 

A more complete description of the Parallel STL is given in [11] 



3. A Simple Example: The Spanning Tree of a Graph 

The minimum spanning tree algorithm [12] takes a graph with weighted con- 
nections and attempts to nd a tree that contains every vertex of the graph 
so that the sum of connection weights in the tree is minimal. 

The graph is represented by the adjacency matrix W of dimensions n n, 
where n is the number of vertices in the graph. W[v j] contains the weight 
of the connection between vertex i and vertex j. W[n j] is set to in nity if 
vertex i and vertex j are not connected. 

The algorithm starts with an arbitrary vertex of the graph and considers 
it to be the root of the tree being created. Then, the algorithm iterates 
n (S> 1 times choosing one more vertex from the pool of unselected vertices 
during each iteration. The pool of unselected vertices is represented by the 
distance vector D. D[i\ is the weight of the connection from an unselected 
vertex i to the closest selected vertex. During each iteration, the algorithm 
selects a vertex whose corresponding D value is the smallest among all the 
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unselected vertices. It adds the selected vertex to the tree and updates the 
values in D for the rest of the unselected vertices in the following way. or each 
remaining vertex, it compares the corresponding D value with the weight of 
the connection between the newly selected vertex and the remaining vertex. 
If the weight of the new connection is less than the old D value, it is stored 
in D. 

After the n®\ iterations D will contain the weights of the selected con- 
nections. We can parallelize this algorithm by searching for the minimum in 
D and updating D in parallel. 

To conserve memory, we decided to deal with sparse graphs and impose a 
limit on the number of edges for any one vertex. We represent the adjacency 
matrix IT by a distributed vector of an edgeJist of pairs. Each edgeJist de- 
scribes all the edges for one vertex; each pair represents one weighted edge 
where the rst element is the weight of the edge and the second element is 
the index of the destination vertex. 

class weighted_edge{ 
int weight; 
int vertex; 

>; 

struct edge_list { 

typedef weighted_edge+ iterator; 
weighted_edge my_edges [MAX_EDGES] ; 
int num_edges; 

iterator begin () { return my_edges; } 
iterator end() {. return my_edges+num_edges ; } 

>; 



typedef distributed_vector <edge_list> Graph; 

Graph W (n) ; 

We represent the distance vector D by a distributed vector of pairs. The rst 
element of each pair is the D value - the weight of the connection from the 
corresponding unsclcctcd vertex to the closest selected vertex. The second 
element of each pair is used as a ag of whether the corresponding vertex has 
already been selected to the tree. It is set to the pair s index iiD until the 
vertex is selected and is assigned -1 after the vertex is selected. 

struct cost_to_edge{ 
int weight; 
long index; 

cost_to_edge (int _weight, long _to_vertex) ; 

>; 
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typedef distributed_vector <cost_to_edge> DistanceVector ; 

DistanceVector D(n); 

The main part of the program is a for loop that repeatedly nds a minimum in 
the distance vector D using par-reduction, marks the found vertex as selected, 
and updates the distance vector using par-apply. The call to par-reduction 
uses the identity function-class as the operation to apply to each element of 
the distributed vector (it simply returns its argument) and a min function- 
class as the reducing operation (it compares two pairs and returns the one 
with smaller weight). Min also requires an initial value for the reduction. In 
this case an edge cost pair with weight INT_MAX. 

forClong i=l; i<n; i++){ 

cost_to_edge u = par_reduction(D.parbegin() , 

D .parendO , 
minO , 

identity<cost_to_edge>() , 
cost_to_edge (INT_MAX, -1) ) ; 

D[u. index] = cost_to_edge (u . weight , -1); 

par_apply (D.parbegin () , D.parend () , update (u . index , W)); 



} 

The second statement in the loop body marks the found vertex as selected. 
The last statement updates D using update function-class. Update de nes 
operator () that takes a reference to an element in D and replaces that 
element with a new pair if it nds a lower weight edge in the graph to the 
element of D. Since the update function-object needs to refer to the adjacency 
matrix and the index of the newly selected vertex, we have to store reference 
to them in instance variables of the function object. We do that by passing 
n and w to the updaie constructor. 

struct update { 
long u; 

Graph few; 

update (long ul , Graph &g) : u(ul) ,w(g) {}; 
void operator () (cost_to_edge & v) 

if (v. index >= 0) { 

Graph: :pariterator w_iter = w.parbegin () ; 
edge_list wi = w_iter [v . index] ; 
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// find a edge from v. index that goes to u 
edge_list :: iterator temp = 

f ind_if (wi . begin () ,wi . endO ,FindVert (u) ) ; 

int weight_uv = (temp==wi . endO )? 

INT_MAX: (*temp) . weight ; 



if 



> 



} 



}; 



(v. weight > weight_uv) 

V = cost_to_edge (weight_uv, v. index); 
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5.84 


3.93 


6 


0.371 


5.40 


4.25 
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0.402 


5.03 


4.59 


8 


0.470 


4.94 


4.64 



Table 3.1. Spanning Tree Performance Results 



The basic parallel STL has been prototyped on the SGI Power Challenge 
and the IBM SP2. In table 3.1 we show the execution time for the spanning 
tree computation on a graph with 1000 vertices on the SGI. The computation 
is dominated by the reduction. This operation will have a speed-up that grows 
n c iog{P)<N where P is the number of processors and N is the problem 
size and C is the ratio of the cost of a memory reference in a remote context 
to that of a local memory reference. In our case that is approximately 200. 
The resulting speed-up for this size problem with 8 processors is 5, so our 
computation is performing about as well as we might expect. We have also 
included the time to build the graph. It should be noted that the time to build 
the graph grows with the number of processors. We have not yet attempted 
to optimize the parallelization of this part of the program. 
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4. Multi-threaded Programming 

The implementation of HPC++ described here uses a model of threads that is 
based on a Thread class which is, by design, similar to the Java thread system. 
More sped cally, there are two basic classes that are used to instantiate 
a thread and get it to do something. Basic Thread objects encapsulate a 
thread and provide a private data space. Objects of class Runnable provide 
a convenient way for a set of threads to execute the member fnnctions of a 
shared object. 

The interface for Thread is given by 

class HPCxx_Thread{ 
public : 

HPCxx_Thread (const char *name = NULL) ; 
HPCxx_Thread(HPCxx_Runnable ^runnable , 
const char *name = NULL) ; 
virtual ~HPCxx_Thread() ; 

HPCxx_Thread& operator=(const HPCxx_Thread& thread) ; 

virtual void runO ; 

static void stopCvoid ^status) ; 

static void yieldO ; 

void resume 0 ; 

int isAlive () ; 

static HPCxx_Thread *currentThread() ; 
void joinClong milliseconds = 0, 
long nanoseconds = 0) ; 
void setName (const char *name) ; 
const char *getName(); 
int getPriority 0 ; 
int setPriority (int priority) ; 
static void sleepdong milliseconds, 

long nanoseconds = 0) ; 

void suspend 0 ; 
void start () ; 

>; 

The interface for Runnable is given by 

class HPCxx_Runnable{ 
public : 

virtual void runO = 0; 

}; 

There arc two ways to create a thread and give it work to do. The rst 
is to create a subclass of RunnaMe which provides an instance of the run{) 
method, or example, to make a class that prints a message we can write 
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class MyRunnable: public HPCxx_Runnable{ 
char *x; 
public : 

MyRunnable (char *c) : x(c){} 
void run(){ 

printf (x) ; 

} 

>; 

The program below will create an instance of two threads that each run the 
run{) method for a single instance of a runnable object. 

MyRunnable rC'hello world"); 

Thread *tl = new Thread (&r) ; 

Thread *t2 = new Thread (&r) ; 

tl->start(); // launch the thread but don t block 
t2->start 0 ; 

This program prints 

hello worldhello world 

It is not required that a thread have an object of class Runnable to execute. 
One may subclass Thread to provide a private data and name space for a 
thread and overload the run{) function there as shown below. 

class MyThread: public HPCxx_Thread-[ 
char *x; 
public : 

MyThread(char *y) : x(y), HPCxx_Thread() {} 
void run(){ 

printf (x) ; 

> 

>; 

int mainCint argv, char *argc){ 

HPCxx_Group *g; 
hpcxx_init (feargv , feargc, g) ; 

MyThread *tl = new MyThreadC'hello") ; 
tl->start 0 ; 

return hpcxx_exit (g) ; 

} 

The decision for when to subclass Thread or Runnable depends upon the 
application. As we shall seen in the section on implementing HPC++ parallel 
loops, there arc times when both approaches arc used together. 

The initialization funetion hpcxxJnitQ strips all eommand line ags of 
the form ®hpcxx- from the and arge array so that application ags are passed 
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to the program in normal order. This call also initializes the object g of type 
H PC XX -Group which is used for synchronization purposes and is described 
in greater detail below. The termination function hpcxx-exit{) is a clean up 
and termination routine. 

It should be noted that in this small example, it is possible for the main 
program to terminate prior to the completion of the two threads. This would 
signal an error condition. We will discuss the ways to prevent this from hap- 
pening in the section on synchronization below. 



4.1 Synchronization 



There are two types of synchronization mechanisms used in this HPCH — h im- 
plementation: collective operator objects and primitive synchronization ob- 
jects. The collective operations arc based on the Hpexx-Group class which 
plays a role in HPC-I--I- that is similar to that of the communicator in MPI. 

4.1.1 Primitive Sync Objects. There are four basic synchronization 
classes in the library: 

A H PCxx-Sync < T > object is a variable that can be written to once 
and read as many times as you want. However, if a read is attempted prior 
to a write, the reading thread will be blocked. Many readers can be waiting 
for a single p[ PCX X Sync < T > object and when a value is written to it 
all the readers are released. Readers that come after the initial write see this 
as a const value. CC-I--I- provides this capability as the sync modi er. 

The standard methods for HPCxxSync < T > are 



template<class T> 
class HPCxx_Sync{ 
public : 



operator T() ; 


// 


operator =(T &) ; 


// 


void read(T &) ; 


// 


void write (T &) ; 


// 


bool peek(T &) ; 


// 




// 



read a value 
assign a value 
another form of read 
another form of writing 
TRUE if the value is there, 
returns FALSE otherwise. 



HPCxxSyncQ < T > provides a dual ‘ queue‘ of values of type T. Any 
attempt to read a sync variable before it is written will cause the reading 
thread to suspend until a value has been assigned. The thread waiting will 
‘ take the value' from the queue and continue executing. Waiting threads are 
also queued. The thread in the queue will receive the value written to 
the sync variable. 
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There are several other standard methods for SyncQ< T >. 

template<class T> 
class HPCxx_SyncQ{ 
public : 

operator T(); // read a value 

operator =(T &) ; // assign a value 

void read(T &) ; // another form of read 

void write (T &) ; // another form of writing 

int length 0 ; // the number of values in the queue 

// wait until the value is there and then 

// read the value but do not remove it from 

// the queue. The next waiting thread is signaled. 

void waitAndCopy (T& data) ; 

bool peek(T &) ; // same as SyncO 



> 

or example, threads that synchronize around a producer-consumer interac- 
tion can be easily build with this mechanism. 

class Producer: public HPCxx_Thread-[ 

HPCxx_SyncQ<int> &x; 
public : 

Producer ( HPCxx_SyncQ<int> &y) : x(y){} 
void run(){ 

printf ( "hi there\n" ) ; 

X = 1; // produce a value for x 

> 

>; 

int main (int argc, char *argv[]){ 

Hpcxx_Group *g; 
hpcxx_init (feargc , feargv, g) ; 

HPCxx_SyncQ<int> a; 

MyThread *t = new Producer (a); 

printf ( "start then wait for a value to assignedX"); 
t->start 0 ; 

int X = a; // consume a value here. 
hpcxx_exit (g) ; 
return x; 

} 

Counting semaphores.. HPCxx-CSem provide a way to wait for a group of 
threads to synchronize termination of a number of tasks. When constructed, 
a limit value is supplied and a counter is set to zero. A thread executing 
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waitAndReset() will suspend until the counter reaches the ‘ limit‘ value. The 
counter is then reset to zero. The overloaded ’ ++‘ operator increments the 
counter by one. 

class HPCxx_CSem{ 
public : 

HPCxx_CSem(int limit) ; 

// prefix and postfix ++ operators. 

HPCxx_CSem& operator++() ; 
const HPCxx_CSem& operator++() ; 

HPCxx_CSem& operator++ (int) ; 
const HPCxx_CSem& operator++ (int) ; 

waitAndReset 0 ; // wait until the count reaches the limit 
// then reset the counter to 0 and exit. 

}; 

By passing a reference to a HPCxx.CSem to a group of threads each of which 
does a ' H — 1-‘ prior to exit, you can build a imilti-thrcadcd ‘ join‘ operation. 

class Worker: public HPCxx_Thread{ 

HPCxx_CSem fee ; 
public : 

Worker (HPCxx_CSem &c_) : c(c_){} 
void run(){ 

// work 

C++; 

> 

>; 

int main (int arge, char *argv[]){ 

HPCxx_Group *g; 
hpcxx_init (fearge , feargv, g) ; 

HPCxx_CSem cs (NUMWDRKERS) ; 

for (int i = 0; i < NUMWDRKERS; i++) 

Worker *w = new Worker (cs) ; 
w->start 0 ; 

> 

cs .waitAndReset 0 ; //wait here for all workers to finish. 
hpcxx_exit (g) ; 
return 0; 

> 



Mutex locks. Unlike Java, the library cannot support synchronized methods 
or CC++ atomic members, but a simple Mutex object with two functions 
lock and unlock provide the basic capability. 
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class HPCxx_Mutex{ 
public : 

void lockO ; 
void unlockO ; 

>; 

To provide a synchronized method that only allows one thread at a time 
execution authority, one can introduce a private mutex variable and protect 
the critical section with locks as follows. 

class Myclass: public HPCxx_Runnable{ 

HPCxx.Mutex 1; 
public : 

void synchronizedO { 

1 . lock 0 ; 

1 . unlock 0 ; 

} 

4.1.2 Collective Operations. Recall that an HPC++ computation con- 
sists of a set of nodes, each of which contains one or more contexts. Each 
context runs one or more threads. 

To access the node and context structure of a computation the HPC+-|-Lib 
initialization creates an object called a group. The HPCxx -Group class has 
the following public interface. 

class HPCxx_Group{ 
public : 

// Create a new group for the current context. 
HPCxx_Group(hpcxx_id_t fegrouplD = HPCXX_GEN_L0CAL_GR0UP1D , 

const char *name = NULL) ; 

// Create a group whose membership is this context 
//and those in the list 
HPCxx_Group (const HPCxx_ContextID *&id, 
int count , 

hpcxx_id_t fegrouplD = HPCXX_GEN_LOCAL_GROUPID , 
const char *name = NULL ) ; 

~HPCxx_Group() ; 
hpcxx_id_t fegetGroupID 0 ; 

static HPCxx_Group *getGroup (hpcxx_id_t groupID) ; 

// Get the number of contexts that are participating 

// in this group 

int getNumContextsO ; 

// Return an ordered array of context IDs in 

// this group. This array is identical for every member 
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// of the group. 

HPCxx_ContextID +getContextIDs () ; 

// Return the context id for zero-based context <n> where 
// <n> is less than the current number of contexts 
HPCxx_ContextID getContextlDdnt context); 

// Set the number of threads for this group in *this* 

// context . 

void setNumThreads (int count); 
int getNumThreads 0 ; 
void setName (const char *name) ; 
const char *getName(); 

>; 

As shown below, a node contains all of the contexts running on the ma- 
chine, and the mechanisms to create new ones. 

class HPCxx_Node{ 
public : 

HPCxx_Mode (const char *name = NULL); 

HPCxx_Node (const HPCxx_Node fenode) ; 

~HPCxx_Node() ; 

bool contextIsLocal (const HPCxx_ContextID &id) ; 
int getNumContexts 0 ; 

// Get an array of global pointers to the contexts 
// on this node. 

HPCxx_GlobalPtr<HPCxx_Context> *getContexts () ; 

// Create a new context and add it to this node 
int addContext 0 ; 

// Create a new context and run the specified executable 
int addContext (const char *execPath, char +*argv) ; 
void setName (const char *name) ; 
const char *getName(); 

>; 

A context keeps track of the threads, and its ContextID provides a handle 
that can be passed to other contexts. 

class HPCxx_Context{ 
public : 

HPCxx_Context (const char *name=NULL) ; 

~HPCxx_Context 0 ; 

HPCxx_ContextID getContextIDO ; 
bool isMasterContext 0 ; 
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// Return the current number of threads in this context 
int getNumThreads 0 ; 

// Null terminated list of the current threads in this 
node hpcxx_id_t *getThreadIDs () ; 

// Return the number of groups of which this context is 

// a member. 

int getNumGroups 0 ; 

// Return a list of the groups of which this context is 
// a member. 

hpcxx_id_t *getGroupIDs 0 ; 
void setName (const char *name) ; 
const char *getName(); 

>; 

A group object represents a set of nodes and contexts and is the basis for 
collective operations. 

Groups are used to identify sets of threads and sets of contexts that par- 
ticipate in collective operations like barriers. In this section we only describe 
how a set of threads on a single context can use collective operations. Multi- 
context operations will be described in greater detail in the multi-context 
programming sections below. 

The basic operation is barrier synchronization. This is accomplished in 
following steps: 

We rst allocate an object of typeHPCxx.Group and set the number of 
threads to the maximum number that will participate in the operation. or 
example, to set the thread count on the main group to be 13 we can write 
the following. 

int main (int argc, char *argv[]){ 

HPCxx_Group *g 

hpcxx_init (feargc , feargv, g) ; 

g->setXhreadCout (13) ; 

HPCxx_Barrier barrier(*g); 

As shown above, a HPCxx-Barrier object must be allocated for the group. 
This can be accomplished in three ways: 

Use the group created in the initialization hpcxxJ,nit(). This is the standard 
way SPMD computations do collective operations and it is described in 
greater detail below. 

Allocate the group with the constructor that takes an array of context IDs 
as an argument. This provides a limited form of ‘ subset‘ SIMD parallelism 
and will also be described in greater detail later. 

Allocate a group object with the void constructor. This group will refer to 
this context only and will only synchronize threads on this context. 



The constructor for the barrier takes a reference to the Group object. 
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Each thread that will participate in the barrier operation must then ac- 
quire a key from the barrier object with the getKey{) function. Once the 
required number of threads have a key to enter the barrier, the barrier can 
be invoked by means of the overloaded () operator as shown in the example 
below. 

class Worker: public HPCxx_Thread{ 
int my_key; 

HPCxx_Barrier febarrier; 
public : 

Worker (HPCxx_Barrier & b) : barrier(b){ 
my_key = barrier . getKeyO ; 

> 

void run(){ 

while ( notdone )■[ 

/ / work 
barrier (key) ; 

} 

> 

>; 

int main (int argc, char *argv[]){ 

HPCxx_Group *g; 
hpcxx_init (feargc , feargv, g) ; 
g->setXhreadCout (13) ; 

HPCxx_Barrier barrier(g); 
for(int i = 0; i < 13; i++)f 

Worker *w = new Worker (barrier) ; 
w->start 0 ; 

> 

hpcxx_exit (g) ; 

> 

A thread can participate in more than one barrier group and a barrier can 
be deallocated. The thread count of a Group may be changed, a new barrier 
may be allocated and thread can request new keys. 

Reductions. Other collective operations exist and they are subclasses of the 
HPCxx_Barrier class, or example, letintAdd be the class, 

class intAdd{ 
public : 

int & operator 0 (int &x, int &y) { x += y; return x;} 

>; 

To create an object that can be used to form the sum-reduction of one integer 
from each thread, the declaration takes the form 

HPCxx_Reductl<int , intAdd> r(group); 
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and it can be used in the threads as follows: 

class Worker: public HPCxx_Thread{ 
int my_key; 

HPCxx_Reductl<int , intAdd> feadd; 
public : 

Worker (HPCxx_Reductl<int , intAdd> & a): add(a){ 
my_key = add.getKey () ; 

> 

void runOd 

int X =3.14*my_id; 

// now compute the sum of all x values 
int t = add(key, x, intAddO ); 

} 

> 

>; 

The public de nition of the reduction class is given by 

template <class T, class Oper> 
class HPCxx_Reductl : public HPCxx_Barrier{ 
public ; 

HPCxx_Reductl (HPCxx_Group &) ; 

T operator 0 (int key, T &x, Oper op); 

T* destructive (int key, T ^buffer, Oper op); 

>; 

The operation can be invoked with the overloaded () operation as in the ex- 
ample above, or with the destructivei) form which requires a user supplied 
buFer to hold the arguments and returns a pointer to the buFer that holds 
the result, to avoid making copies all of the buFers are modi ed in the com- 
putation. This operation is designed to be as e cient as possible, so it is 
implemented as a tree reduction. Hence the binary operator is required to be 
associate, i.e. 

op(x, op(y, z)) == op( op(x, y) , z) 

The destructive form is much faster if the size of the data type T is large. A 
mult-argument form of this reduction will allow operations of the form 

sum = Op{xli‘ x2p xKi) 

i=0^n 

and it is declared as by the template 

template < class R, class Tl, class T2 , ... TK , 
class Dpi, class Dp2 > 
class HPCxx_ReductK{ 
public : 
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HPCxx_ReductK(Hpxx_Group &) ; 

R & operator 0 (int key, Tl, T2, Tk , 0p2, Dpi); 

>; 

where K is 2, 3. 4 or 5 in the current implementation and Opt returns a value 
of type R and Op2 is an associative binary operator on type R. 

Broadcasts. A synchronized broadcast of a value between a set of threads is 
accomplished with the operation 

template < class T > 
class HPCxx_Bcast{ 
public : 

HPCxx_Bcast (HpxxGroup &) ; 

T operator 0 (int key, T *x) ; 

In this case, only one thread supplies a non-null pointer to the value and all 
the others receive a copy of that value. 

Multicasts. A value in each thread can be concatenated into a vector of values 
by the synchronized multicast operation. 

template < class T > 
class HPCxx_Mcast{ 
public : 

HPCxx_Mcast (Hpxx_Group &) ; 

T * operator 0 (int key, T &x) ; 

In this case, the operator allocates an array of the appropriate size and copies 
the argument values into the array in ‘ key‘ order. 

4.2 Examples of Multi-threaded Computations 

4.2.1 The NAS EP Benchmark. The NAS Embarrassingly Parallel bench- 
mark illustrates a common approach to parallelizing loops using the thread 
library. The computation consists of computing a large number of Gaussian 
pairs and gathering statistics about the results (see [2] for more details) . The 
critical component of the computation is a loop of the form: 

double q[nq] , gc ; 
for(k = 1; k <= nn; k++) 
compute_pairs (k) ; 

The computc-pairs function calculates the pairs associated with the param- 
eter k and updates the array q and the scalar gc by adding in new values 
computed for that value of k. There are no other side-eFects of calling com- 
pute_pairs, so parallelization is very easy. 

Our approach is to partition the loop and encapsulate the computation 
into a set of independent objects which each compute a subset of the it- 
erations. To accomplish this we duplicate the array q and the scalar gc as 




HPC++ and the HPC++Lib Toolkit 



93 



members of a HPCxx-Runnable class Gaussian shown below. In addition, 
each Gaussian object is given a unique identi er which is used to identify the 
subset of the iteration space that the object is responsible for computing. 

The main program creates THREAD_NUM instances of the Gaussian 
class. To signal the termination of the computation, the main program also 
allocates a HPCxx-CSem object initialized to count to THREAD JSIUM and 
each Gaussian object is given a pointer to this object. When the threads 
executing the Gaussian objects complete their task they each increment this 
counter. The main thread waits until the total count reaches THREAD-NUM 
and then calculates the sum of the q and gc values. 

class Gaussian : public HPCxx_Runnable{ 
int my Id; 

HPCxx_CSem *cs; 
public : 
double q[nq] ; 
double gc; 

void initCint id, HPCxx_CSem *cs_){ 
myld=id; cs=cs_; .... } 
void compute_pairs (int kk) ; 
void run(void)-[ 

forCint k = myId+1; k <= nn; k = k+THREAD_NUM) { 
compute_pairs (k) ; 

} 

(*cs)++; 

> 

>; 



void main (int arge, char *argv[]){ 
double gc; 
int i, k; 

HPCxx_Group *g; 
hpcxx_init (fearge , feargv, g) ; 

G = (Gaussian *) new Gaussian [THREAD_NUM] ; 
HPCxx_CSem cs (THREAD_NUM) ; 

for(i = 0; i < THREAD.NUM; i++) G[i].init(i, fees); 
for(k = 0; k < THREAD.NUM; k++){ 

HPCxx_Thread *th = new HPCxx_Thread(&G [k] ) ; 
th->start 0 ; 

> 

cs .waitAndResetO ; 
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ford = 0; i < nq; i++) q[i] = 0.0; 
gc = 0.0; 

for(k = 0; k < NODES; k++) 
for( i = 0; i < nq; i++){ 
q[i] += G [k] . q[i] ; 
gc += G[k] .gc; 

> 

lipcxx_exit (g) ; 

} 

This program was run on a 10 processor SGI Power Challenge with the 
value of THREAD_NUM ranging between 1 and 4096. As shown in the table 
below, the performance is predictable and linear through 8 threads. Beyond 
that point the behavior is very irregular. Because the threads are scheduled 
dynamically and must compete for system resource processes, the perfor- 
mance for diFerent runs can vary by as much as 50%. (This is common on 
the SGI SMP systems.) In the table we plot the maximum and minimum 
recorded execution times and the speed-up associated with the best time. 
The fact that speed-up values can exceed the number of processors available 
(10) is probably due to fortunate scheduling behavior rather than locality. 

The important thing to notice about the behavior is that the sequential 
initialization of threads in the main program does not harm the performance 
even where the number of threads generated per processor is over 200. 

The threads in this implementation are based on the SGI Pthreads library. 
The limit to the number of threads that can be generated was reached when 
we tried to use 4096. Most attempts to run the program with this number of 
threads resulted in runtime errors when requesting threads. 

4.2.2 A Parallel Quicksort. To illustrate another example of using threads 
and synchronization in parallel programming, consider the problem of par- 
allelizing a recursive computation. Quicksort is a standard fast sequential 
sorting algorithm, but it is not considered to be well suited for paralleliza- 
tion. or a problem of sizen, the average execution time is 0{n login)). If 
we parallelize this with p processors, the speed-up is bounded by 
With n = and p — 10, this bound is 6.6. However, this bound can only be 
reached if the quicksort algorithm is lucky and selects the perfect partition 
value for the array at each step. 

Our implementation of the parallel quicksort is shown below. The algo- 
rithm is conventional except at the point of the rst recursive call where a 
new thread is generated if the size of the array is not too small or the call 
tree is not too deep. (We pass an additional parameter to keep track of the 
depth of the call tree.) Glearly there is no reason to spawn a thread for very 
small arrays; the cost of a small sequential sort may be less than the cost of 
spawning a new thread. The depth heuristic is also used to limit the number 
of threads in cases where the algorithm makes a series of bad choices for the 
partition value. 
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threads 


max time 


min time 


max speed-up 


1 


218.3 


218.3 


1.0 


2 


116.5 


116.1 


1.9 


4 


59.7 


59.0 


3.7 


8 


28.3 


28.1 


7.8 


16 


28.1 


25.2 


8.0 


32 


31.6 


24.7 


8.8 


64 


29.4 


20.7 


10.5 


128 


22.5 


19.6 


11.1 


256 


30.5 


20.4 


10.7 


512 


29.4 


20.3 


10.8 


1024 


30.8 


19.5 


11.2 


2048 


29.7 


19.4 


11.2 


4096 


- 


33.8 


- 



Table 4.1. NAS Embar Benchmark using Threads 



A H PCxx-Sync < int > variable is used to synchronize the spawned 
thread with the calling thread. 

template <class T> 

void pqsortC T *x, int low, int high, int depth)!! 
HPCxx_Sync<int> s; 
int i = low; 
int j = high; 
int k = 0; 
int checkflag = 0; 
if (i >= j) return; 

T m = X [i] ; 
while ( i < j ) -[ 

while ((x[j] >= m) && (j > low)) j — ; 
while ((x[i] < m) && (i < high)) i++; 
if(i < j){ swap(x[i], x[j]); > 

> 

if ( j < high) { 

if( (j-low < MINSIZE) II (depth > MAXDEPTH) ){ 
pqsortCx, low, j, depth+1) ; 

} 

else{ 

SortThread<f loat> Th(s, x, low, j, depth+1); 

Th. start 0 ; 
checkflag = 1; 

} 

> 
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if(j+l > low) pqsortCx, j+1, high, depth); 
int y; 

if (checkf lag) s.read(y); 

} 

The SortThread < T > class is based on a tcmplatcd subclass of thread. The 
local state of each thread contains a reference to the sync variable, a pointer 
to the array of data and the index values of the range to be sorted and the 
depth of the tree when the object was created. 

template <class T> 

class SortThread: public HPCxx_Thread{ 

HPCxx_Sync<int> &s; 

T *x; 

int low, high; 
int depth; 
public : 

SortThread (HPCxx_Sync<int> &s_, T *x_, 

int low_, int high_, int depth_) : 
s(s_), x(x_), low(low_) , high(high_) , 
depth (depth.) , HPCxx_Thread(NULL){} 
void run(){ 

int k = 0; 

pqsortCx, low, high, depth); 
s . write (k) ; 

} 



>; 

This program was run on the SGI Power Challenge with 10 processors for 
a series of random arrays of 5 million integers. As noted above, the speed-up 
is bounded by the theoretical limit of about 7. We also experimented with the 
MINSIZE and MAXDEPTH parameters. We found that a MINSIZE of 500 
and a MAXDEPTH of 6 gave the results. We ran several hundred experiments 
and found that speed-up ranged from 5.6 to 6.6 with an average of 6.2 for 10 
processors. With a depth of 6 the number of threads was bounded by 128. 

What is not clear from our experiments is the role thread scheduling and 
thread initialization play in the nal performance results. This is a topic that 
is under current investigation. 



5. Implementing the HPCH — h Parallel Loop Directives 

Parallel loops in HPC-I-+ can arise in two places: inside a standard function 
or in a member function for a class. In both cases, these occurrences may be 
part of an instantiation of a template. We rst describe the transformational 
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problems associated with class member functions. The approach we use is 
based on a style of loop transformations invented by Aart Bik for paralleliz- 
ing loops in Java programs [4]. The basic approach was illustrated in the 
Embarrassingly Parallel example in the previous section. 

Consider the following example. 

class C{ 

double X [N] ; 
float &y; 
public : 

void f (float *z , int n){ 
double tmp = 0; 

#pragma HPC.INDEPENDENT 
forCint i = 0; i < n; i++){ 

#pragma HPC_REDUCE tmp 
tmp += y*x[i]+*z++; 

> 

cout << tmp; 

} 

>; 

To parallelize the loop we will replace it with another loop which spawns a 
number (IC) of threads. Each thread will execute a new class member function 
(generated by the compiler) which executes a subset of the iterations of the 
original loop. The threads that are spawned to execute the loop are of a 
subclass of the HPCxx.Thread class that is generated by the compiler. 

The primary task of the compiler is to resolve the scope of the variables 
that are referenced within the body of the loop and to determine which of 
these must be duplicated in each thread and which need only be referenced 
from the thread. More sped cally, we see the loop in our example contains 

loop bounds such as n and other variables, in this case z and tmp, that are 

scoped within the member function that contains the loop. 

variables that are referenced within loop but are data members of the 

object or class. 

globally de ned variables. 

The last two cases arc not a problem because the semantics of a parallel loop 
require that any loop carried data dependences be a reduction form and be 
so labeled (as with the variable tmp in the example above. Consequently, 
if these are not reduction variables, we can refer to them directly from the 
generated member function. 

However, variables that arc members of the rst category arc private 
to the scope of the original member function. Copies or references to these 
variables must be duplicated as private data members for new thread class. 
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In addition, a description of the subset of iterations to be executed by each 
thread object must also be included in the thread object. This iteration space 
partitioning can be done in any of the classical ways (see [4]). In this case we 
illustrate the example with simple loop blocking. 

In our example, the thread class that is generated is given below. 

class C_Thread: public HPCxx_Thread-[ 
float *z; 

int bs; // the iterations subspace block size 
int base; // the start of the iteration subspace, 
double fetmp; 

HPCxx_Reductl<double, doubleAdd> &add; 
int key; 
public : 

void C_Thread(f loat *z_, double &tmp_, int bs_, 

HPCxx_Reductl<double , doubleAdd> &add_, int base_) : 
ThreadO, add(add_) , z(z_), tmp(tmp_), bs(bs_), 
base (base_) {key = add . getKey () } 
void run(){ 

r->f _blocked(z , bs , base, key); 

> 

>; 

The user s clas£7 must be modi ed as follows. The compiler must create 
a new member function, called f_blocked() below, which executes a blocked 
version of the original loop. This function is called by the thread class as 
shown above. In addition, the original loop has to be modi ed to spawn the 
K threads (where K is determined by the runtime environment) and start 
each thread. (As in our previous examples, we have deleted the details about 
removing the thread objects when the thread completes execution.) 

The user class is also given local synchronization objects to signal the ter- 
mination of the loop, or loops that do not have any reduction operations, we 
can synchronize the completion of the original iteration with a HPCxx.CSem 
as was done in the Einbar benchmark in the previous section. However, in 
this case, we do have a reduction, so we use the reduction classes and gener- 
ate a barrier-based reduction that will synchronize all of the iteration classes 
with the calling thread as shown below. 

class C{ 

double X [N] ; 
float &y; 

HPCxx_Group g; 

HPCxx_Reductl<double , doubleAdd> add; 
public : 

C( . . . ) : add(g){ .... } 
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void f _blocked(f loat *z, double &tmp, int bs , int base, 
int key){ 

z = z+bs ; 
double t = 0; 

forCint i = base; i < base+ bs; i++){ 
t = y*x[i]+*z++; 

> 

add (key, t, doubleAddO ) ; 

> 

void f (float *z , int n){ 
double tmp = 0; 

g . setThreadCount (K+1) ; 

int key = add.getKeyO ; 

for (int th = 0; th < K; th++){ 

C_Thread *t = new C_Thread(this ,z, tmp, n/K, 

th* (n/K) , add) ; 

t->start 0 ; 

> 

tmp = add (key, 0.0, doubleAddO); 
cout << tmp; 



We have omitted treatment of the scheduling algorithms that are possible 
and refer the reader to [4], 



6. Multi-context Programming and Global Pointers 

The preceding sections of this chapter have discussed multi-threaded parallel 
programming in a single address space context. 

However, one of the powers of HPC-I-+ is the ability of a a program in 
one context to communicate with another. There are two ways to do this. 
One method is for one program to dynamically ‘ attach‘ itself to a second 
program, and the second method is called Single Program Multiple Data 
(SPMD) style, where a number of copies of the same program are loaded 
into diFerent contexts at the same time. HPC+-I- is designed around the 
SPMD model of multi-context programming. 

The central problem associated with multi-context computation is the 
communication and synchronization of events between two contexts. HPC-I-+ 
is based on the CC++ global pointer concept and the library described here 
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implements this with a GlobalPtr <T > template as is done in the MPC++ 
Template Library [10]. 

A global pointer is an object that is a proxy for a remote object. In most 
ways it can be treated exactly as a pointer to a local object. One major 
diFerence is that global pointers can only point to objects allocated from a 
special ‘ global data area' . To allocate a global pointer object of type T from 
the global area one can write 

HPCxx_GlobalPtr<T> p = new ("hpcxx_global") T(args); 
or, for a global pointer to an array of 100 objects, 

HPCxx_GlobalPtr<T> p = new ("hpcxx_global") T[100]; 

or objects of simple type a global pointer can be dereferenced like any other 
pointer, or example, assignment and copy through a global pointer is given 
by 

HPCxx_GlobalPtr<f loat> p = new ("hpcxx_global") int ; 

*p = 3.14; 
float y = 2 - *p; 

Integer arithmetic and the [] operator can be applied to global pointers in 
the same way as ordinary pointers. Because global pointer operations are far 
more expensive than regular pointer dereferencing, there are special operators 
for reading and writing blocks of data. 

void HPCxx_GlobalPtr<T> : : read(T ^buffer, int size); 
void HPCxx_GlobalPtr<T> :: write (T ^buffer, int size); 



Objects of a user-de ned type may be copied through a global pointer only 
if there are pack and unpack friend functions de ned as follows. Suppose you 
have a class of the form shown below. You must also supply a special function 
that knows how to pack an array of such objects. 



class C{ 
int x; 

float y [100] ; 
public : 

friend void hpcxx_pack(HPCxx_Buff er *b, C 

int size) ; 

friend void hpcxx_unpack(HPCxx_Buf f er *b, 

int fesize) ; 



}; 

void hpcxx_pack (HPCxx_Buf f er *b, C ^buffer, int 
hpcxx_pack(b, size, 1); 
forCint i = 0; i < size; i++){ 



♦buffer, 

C ^buffer, 

size) -[ 
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hpcxx_pack (b , C[i].x, 1); 
hpcxx_pack (b , C[i].y, 100); 

> 

} 

void hpcxx_unpack(HPCxx_Buf f er *b, C ^buffer, int &size)-[ 
hpcxx_unpack(b, size, 1); 
for(int i = 0; i < size; i++){ 
hpcxx_unpack (b , C[i].x, 1); 
hpcxx_unpack (b , C[i].y, 100); 

> 

> 

These pack and unpack functions can be considered a type of remote con- 
structor. or example, suppose a class object contains a pointer to a buFer 
in the local heap. It is possible to write the unpack function so that it al- 
locates the appropriate storage on the remote context and initializes it with 
the appropriate data. 

Unfortunately, it is not possible to access data members directly through 
global pointers without substantial compiler support. The following is an 
illegal operation 

class D{ 
public : 
int X ; 

>; 

HPCxx_GlobalPtr<D> p = new ( hpcxx_global ) D; 

p->x; // illegal member reference 

To solve this problem, we must create a data access function that returns 
this value. Then we can make a remote member call. 

6.1 Remote Function and Member Calls 

or a user de ned clasC with member function, 

class C{ 
public ; 

int f 00 (float, char); 

}; 

the standard way to invoke the member through a pointer is an expression 
of the form: 



C *p; 

p->f 00 (3.14, X ) ; 
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It is a bit more work to make the member function call though a global 
pointer. irst, for each typeCwe must register the class and all members that 
will be called through global pointers. Registering the class is easy. There is a 
macro that will accomplish this task and it should be called at the top level. 
We next must register the member as shown below. 

hpxx_register (C) ; 

int mainCint argc, char *argv[]){ 

HPCxx.Group *g; 
hpxx_init (feargc , feargv, g) ; 

hpcxx_id_t C_foo_id = hpxx_register (C : : f oo) ; 

The overloaded register templated function builds a table for systems in- 
formation about registered functions, classes, and member functions, and 
returns its location as an ID. 

Because this ID is an index into the table, it is essential that each context 
register the members in exactly the same order. 

To invoke the member function, there is a special function template 

HPCxx_GlobalPtr<C> P; 

hpcxx_id_t = hpcxx_invoke (P , C_foo_id, 3.13, x ); 

Invoke will call C::foo(3.13, x ) in the context that contains the object that 
P points to. The calling process will wait until the function returns. 

The asynchronous invoke will allow the calling function to continue exe- 
cuting until the result is needed. 

HPCxx_Sync<int> sz; 

hpcxx_ainvoke (&SZ , P, C_foo_id, 3.13, x ); 

.... // go do some work 
int z = sz; // wait here. 

It should be noted that it is not a good idea to pass pointers as argument 
values to invoke or ainvoke. However, it is completely legal to pass global 
pointers and return global pointers as results of remote member invocations. 

6.1.1 Global Functions. Ordinary functions can be invoked remotely. By 
using a ContextID, the context that should invoke the function may be 
identi ed. 

HPCxx_ContextID HPCxx_Group : : getContextID (int i) ; 

or example, to call a function in context ‘ 3‘ from context ‘ 0‘ , the func- 
tion must be registered in each context. (As with member functions, the 
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order of the function registration determines the function identi er, so the 
functions must be registered in exactly the same order in each context.) 

double funCchar x, int y) ; 

int main (int argc, int *argv[]){ 

HPCxx_Group *g; 
hpxx_init (feargc , feargv, g) ; 

hpcxx_id_t = hpcxx_register (fun) ; 

// remote invocation of x = fun( z , 44) ; 
double X = 

hpcxx_invoke (g. getContext (3) , fun_id , z , 44); 

//asynchronous invocation 
HPCxx_Sync<double> sx; 

hpcxx_ainvoke (&SX, g . getContext (3) , fun_id, z , 44 ) ; 
double X = sx; 

> 



6.2 Using Corba IDL to Generate Proxies 

Two of the most anoying aspects of the HPC++Lib are the requirements 
to register member functions and write the pack and unpack routines for 
user-de nod classes. In addition, the use of the invoke template syntax 

HPCxx_GlobalPtr<C> P; 

int z = hpcxx_invoke (P , member_id, 3.13, x ); 
instead of the more conventional syntax 
z = P->member (3 . 13 , x ); 
is clearly awkward. 

Unfortunately, the C++ language does not allow an extension to the 
overloading of the 0 > operator that will provide this capability. However, 
there is another solution to this problem. 

The CORBA Interface De nition Language (IDL) provides a well struc- 
tured language for de ning the public interface to remote objects and serial- 
izable classes. 

As a seperate utility, HPC++Lib provides an IDL to C++ translator 
that maps interface speci cations to user-de ned classes. or example, the 
IDL interface de nition of a remote blackboard object class and the de nition 
of a structure which represents a graphical object ( Gliph) that can be drawn 
on the blackboard. 
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struct Gliph{ 
short type; 
int X, y; 
int r, g, b; 



interface BBoard{ 

int draw (in Gliph mark ) ; 

>; 

The IDL to C++ compiler generates a special case of the global pointer 
template and the header for the blackboard class as shown below. The proto- 
type for BBoard contains a static registration function that registers all the 
member functions. The user only need call this one registration function at 
the start of the program. The specialization of the the global pointer template 
contains the requisite overloading of the 0 > operator and a new member 
function for each of the functions in the public interface. 

class BBoardf 

static int draw_id; 
public : 

static void register(){ 

draw_id = hpcxx_register (Bboard :: draw) ; 

} 

int draw (Gliph mark) ; 

>; 



class HPCxx_GlobalPtr<BBoard>{ 

// implementation specific GP attributes 
public : 

HPCxx_GlobalPtr<BBoard> * operator ->(){ return this; d 
int draw(Gliph mark , float &x){ 

return hpcxx_invoke(*this, BBoard: :draw_id, mark); 

> 

>; 

The structure Gliph in the interface speci cation is compiled into a struc- 
ture which contains the serialization pack and unpack functions. 

Using this tool the user compiles the interface speci cation into a new 
header le. This le can be included into the C++ les which contain the 
use of the class as well as the de nition of the functions likeB Board:: draw. 

To use the class with remote pointers the program must include only the 
registration call 

BBoard: :register() ; 
in the main program. 
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7. The SPMD Execution Model 

The Single Program Multiple Data Model (SPMD) of execution is one of the 
standard models used in parallel scienti c programming. Our library supports 
this model as follows. At program load time n copies of a single program is 
loaded into n processes which each de ne a running context. 

The running processes are coordinated by the hpcxx_init initialization rou- 
tine which is invoked as the rst call in the main program of each context. 



int mainCint arge, char *argv[]){ 

HPCxx_Group *g; 

hpcxx_init (fearge , feargv, g) ; 

As described in the multi-context programming section, the context IDs allow 
one context to make remote function calls to any of the other contexts. 

The SPMD execution continues with one thread of control per context 
executing main. However, that thread can dynamically create new threads 
within its context. There is no provision for thread objects to be moved from 
one context to another. 

7.1 Barrier Synchronization and Collective Operations 

In SPMD execution mode, the runtime system provides the same collective 
operations as were provided before for multi-threaded computation. The only 
semantic diFerence is that the collective operations apply across every context 
and every thread of the group. 

The only syntactic diFerence is that we allow a special form of the over- 
loaded 0 operator that does not require a thread ‘ key‘ . or example, to do 
a barrier between contexts all we need is is the HPCxx-Group object. 



int mainCint arge, char *argv[]){ 
HPCxx_Group *context_set ; 

hpcxx_init (fearge , feargv, context_set) ; 

HPCxx_Barrier barrier (context_set) ; 
HPCxx_Red.net Kf loat , floatAdd> 

f loat_reduct (context_set) ; 

barrier 0 ; 
float z = 3.14 

z = f loat_reduct (z , f loatAddO ) ; 
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Note that the thread key can be used if there are multiple threads in a 
context that want to synchronize with the other contexts. 



8. Conclusion 

This chapter has outlined a library and compilation strategy that is be- 
ing used to implement the HPC+-I- Level 1 design. An earlier version of 
HPC++Lib was used to implement the prototype PSTL library and future 
releases of PSTL will be completely in terms of HPC++Lib. 

Our goal in presenting this library goes beyond illustrating the foundation 
for HPC++ tools. We also feel that this library will be used by application 
programmers direetly and care has been taken to make it as usable as possible 
for large scale seienti e application. In particular, we note that the MPC++ 
MTTL has been in use with the Real World Computing Partnership as an 
application programming platform for nearly one year and CC++ has been 
in use for two years. HPC++Lib is modeled on these successful systems and 
adds only a Java style thread class and a library for collective operations. In 
conjunction with the parallel STL, we feel that this will be an ePective tool 
for writing parallel programs. 

The initial release of HPC++Lib will be available by the time of this 
publication at http://www.extreme.indiana.edu and other sites within the 
Department of Energy. There will be two runtime systems in the initial re- 
lease. One version will be based on the Argonne/ISI Nexus runtime system 
and the other will be using the LANL version of Tulip [3]. Complete docu- 
mentation and sample programs will be available with the release. 
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Summary. In a concurrent object-oriented programming language one would like 
to be able to inherit behavior and realize synchronization control without compro- 
mising the flexibility of either the inheritance mechanism or the synchronization 
mechanism. A problem called the inheritance anomaly arises when synchronization 
constraints are implemented within the methods of a class and an attempt is made 
to specialize methods through inheritance. The anomaly occurs when a subclass vi- 
olates the synchronization constraints assumed by the superclass. A subclass should 
have the flexibility to add methods, add instance variables, and redeflne inherited 
methods. Ideally, all the methods of a superclass should be reusable. However, if the 
synchronization constraints are defined by the superclass in a manner prohibiting 
incremental modification through inheritance, they cannot be reused, and must be 
reimplemented to reflect the new constraints; hence, inheritance is rendered use- 
less. We have proposed a novel model of concurrency abstraction, where (a) the 
specification of the synchronization code is kept separate from the method bodies, 
and (b) the sequential and concurrent parts in the method bodies of a superclass 
are inherited by its subclasses in an orthogonal manner. 



1. Introduction 

A programming model is a collection of program abstractions which pro- 
vides a programmer with a simpliOed, and transparent, view of the com- 
puter s hardware/software system. Parallel programming models are speci9- 
cally designed for multiprocessors, multicomputers, or vector computers, and 
are characterized as: shared-variable, message-passing, data-parallel, object- 
oriented (00), functional, logic, and heterogeneous. Parallel programming 
languages provide a platform to a programmer for elfectively expressing (or, 
specifying) his/her intent of parallel execution of the parts of computations 
in an application. A parallel program is a collection of processes which arc 
the basic computational units. The granularity of a process may vary in dif- 
ferent programming models and applications. In this work we address issues 
relating to the use of the 00 programming model in programming parallel 
machines [20]. 

00 programming is a style of programming which promotes program ab- 
straction, and thus, leads to a modular and portable program and a reusable 
software. This paradigm has radically in uenced the design and development 
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of almost all kinds of computer applications including user-interfaces, data 
structure libraries, computer-aided design, scientiOc computing, databases, 
network management, client-server based applications, compilers and oper- 
ating systems. 

Unlike the procedural approach, where problem-solving is based around 
actions, 00 programming provides a natural facility for modeling and de- 
composing a problem into objects. An object is a program entity which en- 
capsulates data (a.k.a. instance variables) and operations (a.k.a. methods) 
into a single computational unit. The values of the instance variables de- 
termine the state of an object. Objects are dynamically created and their 
types may change during the course of program execution. The computing 
is done by sending messages [a.k.a. method invocation) among objects. In 
Figure 1.1 the methods (marked as labels on the edges) transform the initial 
states (shown as circles), ro and sq, of two objects, Oi and O 2 , into their 9nal 
states, I’M and s^v, respectively. Notice that O 2 sends a message, p 2 , to Oi 
while executing its method, q 2 . 




Fig. 1.1. The object-oriented computing model. 



On one hand, the 00 language features arc a boon to writing portable 
programs e ciently; but on the other hand they also contribute to degraded 
performance at run-time. It turns out that the concurrency is a natural con- 
sequence of the concept of objects: The concurrent use of coroutines in con- 
ventional programming is akin to the concTirrent manipulation of objects 
in 00 programming. Notice from Figure 1.1 that the two objects, Oi and 
O 2 , change their states independently and only occasionally exchange values. 
Clearly, the two objects can compute concurrently. Since the 00 program- 
ming model is inherently parallel, it should be feasible to exploit the potential 
parallelism from an 00 program in order to improve its performance. Fur- 
thermore, if two states for an object can be computed independently, a Oner 
granularity of parallelism can be exploited. For instance in the Actor [1] based 
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concurrent 00 languages, maximal 9ne-grained object-level concurrency can 
be speci9ed. Load-balancing becomes fairly easy as the objects can be freely 
placed and migrated within a system. Such languages have been proposed as 
scalable programming approaches for massively parallel machines [ ]. 

Several research projects have successfully exploited the data parallel 
(SIM ) programming model in C++, notably, Mentat [15], C** [23], pC++ 
[13], and Charm++ [1 ]; probably these researchers were overwhelmed by the 

bene9ts of this model as noted in Fortran 0 and High Performance Fortran 
(HPF) [16] applications. The data parallel models of computations exploit the 
property of homogeneity in computational data structures, as is commonly 
found in scienti9c applications. For heterogeneous computations, however, it 
is believed that the task-parallel ^ (SPM or MIM ) model is more eEtec- 
tive than the data-parallel model. Such a model exists in Fortran M [12] and 
CC-| — h [8]. 

Alost of the 00 applications by nature are based on heterogeneous com- 
putations, and consequently, are more amenable to task-parallelism. Besides, 
the existence of mimerous, inherently concurrent objects, can assist in mask- 
ing the cEbcts of latency (as there is always work to be scheduled). Further- 
more, the SPM model is cleanly represented in a task parallel 00 language 
where the data and code that manipulates that data are clearly idcnti9cd. 

Consider a typical 00 program segment shown below. Our goal is to seek 
a program transformation from the left to the right. In other words, can we 
concurrently execute Si and S 2 , and if so, how? 

par t 

Si: object.;. methodpO; S 1 : object,. rncthodp(); 

S 2 : object j.methodqO; S 2 - objectj.methodg(); 

c> 



There are two possible approaches: either (a) specify the above parallelism 
by writing a parallel program in a Concurrent 00 Programming Language 
(COOPL), or (b) automatically detect that the above transformation is valid 
and then restructure the program. In this work we would limit our discussions 
to speci9cation of parallelism. 

In designing a parallel language we aim for: (i) e ciency in its imple- 
mentation, (ii) portability across diLterent machines, (iii) compatibility with 
existing sequential languages, (iv) expressiveness of parallelism, and (v) ease 
of programming. A COOPL oEfers the bene9ts of object-orientation, primar- 
ily, inherently concurrent objects, ease of programming and code-reuse with 
the task of speci9cation of parallelism (or more generally, concurrency). Sev- 
eral COOPLs have been proposed in the literature (refer to [4, 18, 33] for a 
survey) . 



^ a.k.a. control or functional parallelism. 
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It is unfortunate that when parallel programs are written in most of these 
COOPLs, the inheritance anomaly [25] is unavoidable and the reusability of 
the sequential components is improbable [21,22], In using a COOPL, one 
would want to inherit the behavior and realize synchronization control with- 
out compromising the exibility in either exploiting the inheritance char- 
acteristics or using diEterent synchronization schemes. A problem called the 
inheritance anomaly [25] arises when synchronization constraints are imple- 
mented within the methods of a class and an attempt is made to specialize 
methods through inheritance. The anomaly occurs when a subclass violates 
the synchronization constraints assumed by the superclass. A subclass should 
have the exibility to add methods, add instance variables, and rede9ne in- 
herited methods. Ideally, all the methods of a superclass should be reusable. 
However, if the synchronization constraints are deOned by the superclass in a 
manner prohibiting incremental modi9cation through inheritance, they can- 
not be reused, and must be reimplemented to re ect the new constraints; 
hence, inheritance is rendered useless. 

We claim that following are the two primary reasons for causing the in- 
heritance anomaly [25] and the probable reuse of the sequential components 
in a COOPL [22]: 

The synchronization constraints are implemented within the methods of a 
class and an attempt is made to specialize methods through inheritance. 
The anomaly occurs when a subclass violates the synchronization con- 
straints assumed by the superclass. 

The inheritance of the sequential part and the concurrent part of a method 
code are not orthogonal. 

In this work we have proposed a novel model of concurrency abstraction, 
where (a) the speci9cation of the synchronization code is kept separate from 
the method bodies, and (b) the sequential and concurrent parts in the method 
bodies of a superclass are inherited by its subclasses in an orthogonal manner. 

The rest of the chapter is organized as follows. Section 2 discusses the is- 
sues in designing and implementing a COOPL. In sections 3 and 4 we present 
a detailed description of the inheritance anomaly and the reuse of sequen- 
tial classes, respectively. In section 5 we propose a framework for specifying 
parallelism in COOPLs. In section 6 we review various COOPL approaches 
proposed by researchers in solving diEterent kinds of inheritance anomalies. In 
section 7 we describe the concurrency abstraction model we have adopted in 
designing our proposed COOPL, CORE. Subsequently, we give an overview 
of the CORE model and describe its features. Later, we illustrate with exam- 
ples how a CORE programmer can eEtectively avoid the inheritance anomaly 
and also reuse the sequential classes. Finally, we discuss an implementation 
approach for CORE, summarize our conclusions and indicate directions for 
future research. 
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2. Approaches to Parallelism Specification 

Conventionally, speci9 cation of task parallelism refers to explicit creation of 
multiple threads of control (or, tasks) which synchronize and communicate 
under a programmer s control. Therefore, a COOPL designer should provide 
explicit language constructs for speciQcation, creation, suspension, reactiva- 
tion, migration, termination, and synchronization of concurrent processes. A 
full compiler has to be implemented when a new COOPL is designed. Alter- 
natively, one could extend a sequential 00 language in widespread use, such 
as C++ [11], to support concurrency. The latter approach is more beneQcial 
in that: (a) the learning curve is smaller, (b) incompatibility problems sel- 
dom arise, and most importantly, (c) a preprocessor and a low-level library 
of the target computer system can more easily upgrade the old compiler to 
implement high-level parallel constructs. 

2.1 Issues in Designing a COOPL 

A COOPL designer should carefully consider the pros and cons in selecting 
the following necessary language features: 

1. Active vs. Passive Objects: An active object {a.k.a. an actor in [1]) 
possesses its own thread(s) of control. It encapsulates data structures, 
operations, and the necessary communication and synchronization con- 
straints. An active object can be easily uni9ed with the notion of a light- 
weight process [3]. In contrast, a passive object does not have its own 
thread of control. It must rely on either active objects containing it, or 
on some other process management scheme. 

2. Granularity of Parallelism: Two choices are evident depending upon 
the level of parallelism granularity sought: 

a) Intra-Object: An active object may be characterized as: (i) a se- 
quential object, if exactly one thread of control can exist in it; (ii) 
a quasi- concurrent object, such as a monitor, when multiple threads 
can exist in it, but only one can be active at a time; and (iii) a 
concurrent object, if multiple threads can be simultaneously active 
inside it [8, 33]. 

b) Inter-Object: If two objects receive the same message, then can 
they execute concurrently? For example, in CC++ [8], one can spec- 
ify intcr-objcct parallelism by enclosing the method invocations in- 
side a parallel block, such as, a cobegin-coend construct [3], where 
an implicit synchronization is assumed at the end of the parallel 
block. 

3. Shared Memory vs. Distributed Memory Model (SMM vs. MM): 
Based on the target architecture, appropriate constructs for communica- 
tion, synchronization, and partitioning of data and processes may be 
necessary. It should be noted that even sequential OOP is considered 
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equivalent to programming with messages: a method invocation on an 
object, say x, is synonymous to message send (as in MM) whose re- 
ceiver is X. However, oating pointers to data members (as in C+-|- [11]) 
may present a serious implementation hazard in COOPLs for MMs. 

4. Object Distribution: In a COOPL for a MM, support for location in- 

dependent object interaction, migration, and transparent access to other 
remote objects, may be necessary. This feature may require intense com- 
piler support and can increase run-time overhead [ ]. 

5. Object Interaction: On a SMM, communication may be achieved via 
synchronized access to shared variables, locks, monitors, etc., whereas on 
a MM, both synchronous and asynchronous message passing schemes 
may be allowed. Besides, remote procedure call (RPC) and its two vari- 
ants, blocking RPC and asynchronous RPC (ARPC) [7,34], may be sup- 
ported, too. 

6. Selective Method Acceptance: In a client/server based application 
[6], it may be necessary to provide support for server objects, who receive 
messages non-deterministically based on their internal states, parameters, 
etc.. 

7. Inheritance: The speci9cation of synchronization code is considered as 
the most di cult part in writing parallel programs. Consequently, it is 
highly desirable to avoid rewriting of such code and instead reuse code via 
inheritance. Unfortunately, in many COOPLs, inheritance is either disal- 
lowed or limited [2, 18,32,33], in part due to the occurrence of inheritance 
anomaly [24,25]. This anomaly is a consequence of reconciling concur- 
rency and inheritance in a COOPL, and its implications on a program 
include: (i) extensive breakage of encapsulation, and (ii) rede9nitions of 
the inherited methods. 



2.2 Issues in Designing Libraries 

Parallelism can be made a second class citizen in COOPLs by providing 00 
libraries of reusable abstractions (classes). These classes hide the lower-level 
details pertaining to speci9cation of parallelism, such as, architecture (SMM 
or MM), data partitions, communications, and synchronization. Two kinds 
of libraries can be developed as described below: 

Implicit Libraries: These libraries use 00 language features to encap- 
sulate concurrency at the object-level. A comprehensive compiler support 
is essential for: (i) creating active objects without explicit user commands 
and in the presence of arbitrary levels of inheritance, (ii) preventing ac- 
ceptance of a message by an object until it has been constructed, (iii) 
preventing destruction of an object until the thread of control has been 
terminated, (iv) object interaction, distribution and migration, and (v) 
preventing deadlocks. 
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Explicit Libraries: These class libraries provide a set of abstract data 
types to support parallelism and synchronization. The objects of these 
classes are used in writing concurrent programs. In these libraries, the syn- 
chronization control and mutual exclusion code is more obvious at the user 
interface level. Most of these libraries are generally meant for programming 
on SMM [5]. 

The libraries are a good alternative to parallel programming as they are 
more easily portable. Some notable libraries reported in the literature are: 
ABC++, Parmacs [5], 6»C++, AT&T s Task Library, PRESTO, PAN A, 
AWESIME, ES-Kit, etc. [4,33]. Although the above approaches to parallel 
programming are considered simple and inexpensive, they require sophisti- 
cated compiler support. Moreover, they fail to avoid the inheritance anomaly 
as would be clear from the following section. 



3. What Is the Inheritance Anomaly? 

In this section we describe the inheritance anomaly. An excellent character- 
ization of this anomaly can be found in [25], and it has also been discussed 
in [18,22,24,32,33]. 

De; nition 3.1. A synchronization constraint is that piece of code in a 
concurrent program which imposes control over concurrent invocations on a 
given object. Such constraints manage concurrent operations and preserve the 
desired semantic properties of the object being acted upon. The invocations 
that are attempted at inappropriate times are subject to delay. Postponed 
invocations are those that invalidate the desired semantic property of that 
object^. 

Consider one of the classical concurrency problems, namely, hounded 
bu7er [3]. This problem can be modeled by de9ning a class, B_Buffer, as 
shown in Eigure 3.1. In this class^, there are three methods: B_Buffer, Put, 
and Get. The constructor creates a buEter, buf , on the heap of a user-speci9ed 
size, max. Note that buf is used as a circular array. Both the indices, in and 
out, are initialized to the 9rst element of buf , and n represents the number of 
items stored in the buf so far. Put stores an input character, c, at the current 
index, in, and increments the values of in and n. On the other hand. Get 
retrieves a character, c, at the current index, out, from buf, and decrements 
the value of n but increments the value of out . 

A synchronization constraint is necessary if Put and Get are concurrently 
invoked on an object of B_Buffer. This synchronization constraint(s) must 
satisfy following properties: (i) the execution of Put cannot be deferred as 



^ An excellent characterization of different synchronization schemes can be found 
in [3]. 

^ Note that we have used the C-|— |- syntax for defining classes in this section. 
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class B_Buffer { 

int in, out, n, max; 
char *buf ; 
public : 

B_Buffer(int size) { 
max = size; 
buf = new char [max] ; 
in = out = n = 0; 

> 

void Put (char c) { 

P (empty) ; 

buf [in] = c; 

in = (in+1) 7. max; 

n++ ; 

V(full) ; 

> 

>; 



char Get (void) { 
char c ; 

P(full) ; 

c = buf [out] ; 

out = (out+1) 7o max; 

n— ; 

V (empty) ; 
return c; 



Fig. 3.1. A class deOnitioii for B_Buffer. 



long as there is an empty slot in buf, and (ii) the execution of Get must 
be postponed until there is an item in buf. In other words, the number of 
invocations for Put must be at least one more than the number of invocations 
for Get, but at the most equal to max. Such a constraint is to be provided 
by the programmer as part of the synchronization code in the concurrent 
program. For example, one could model the above synchronization constraint 
by using the P and V operations on two semaphores, full and empty. 

Consider de9ning a subclass of B_Buffer, where we need diDerent syn- 
chronization constraints. The methods Put and Get then may need non-trivial 
redeOnitions. A situation where such redeOnitions become necessary with in- 
heritance of concurrent code in a COOPL is called the inheritance anomaly. 
In the following subsections, we review diDerent kinds of inheritance anoma- 
lies as described in [25,33]. 

3.1 State Partitioning Anomaly (SPA) 

Consider Figure 3.2, where we have de9ned a subclass, B_Buffer2, of the 
base class, B_Buff er, as introduced in Figure 3.1. B_Buffer2 is a specialized 
version of B_Buffer, which inherits Put and Get from B_Buffer and de9nes a 
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new method, GetMorethanOne. GetMorethanOne invokes Get as many times 
as is the input value of howmany. 



class B_Buffer2: public B_Buffer {. 
public : 

char GetMorethanOne (int howmany) { 
char last_char; 

for (int i=0; i < howmany; i++) 
last_char = GetO; 
return last_char; 

} 

}; 



Fig. 3.2. A class de9nition for B_Buffer2. 



Put, Get, and GetMorethanOne can be concurrently invoked on an object 
of B_Buffer2. Besides the previous synchronization constraint for Put and 
Get in B_Buf f er, now we must further ensure that whenever GetMorethanOne 
executes, there are at least two items in buf. Based on the current state 
of the object, or equivalently, the number of items in buf, either Get or 
GetMorethanOne must be accepted. In other words, we must further partition 
the set of acceptable states for Get. In order to achieve such a Oner partition, 
the inherited Get must be redeOned in B_Buffer2, resulting in SPA. 

SPA occurs when the synchronization constraints are written as part of 
the method code and they are based on the partitioning of states of an object. 
This anomaly commonly occurs in accept-set based schemes. In one of the 
variants of this scheme, known as behavior abstraction [18], a programmer 
uses the become primitive [1] to specify the next set of methods that can 
be accepted by that object. An addition of a new method to a subclass is 
handled by rcdcOning such a set to contain the name of the new method. 

The use of guarded methods (as shown in Figure 3.3) can prevent SPA, 
where the execution of methods is contingent upon Orst evaluating their 
guards. 



void Put (char c) when (in < out + max) { . . . } 

char Get (void) when (in >= out +1) { ... } 

char GetMorethanOne (int howmany) when 

(in >= out + howmany) ■[...} 



Fig. 3.3. RedeOned methods from class B_Buffer2. 
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3.2 History Sensitiveness of Acceptable States Anomaly (HSASA) 

Consider Figure 3.4, where we have de9ned yet another subclass, B_Buffer3, 
of B_Buf f er (see Figure 3.1). B_Buf f er3 is a specialized version of B_Buff er, 
which inherits Put and Get and introduces a new method, GetAfterPut. Note 
that the guard for GetAfterPut is also de9ned. 



class B_Buffer3: public B_Buffer { 
public : 

char GetAfterPut 0 

when (!after_put && in >= out + 1) •[ 



>; 



> 



Fig. 3.4. A class de9nition for B_Buffer3. 



Put, Get, and GetAfterPut can be concurrently invoked on an object of 
B_Buffer3. Apart from the previous synchronization constraint for Put and 
Get as in B_Buf f er, we must now ensure that GetAfterPut executes only after 
executing Put. The guard for GetAfterPut requires a boolean, after_put, 
to be true, which is initially false. The synchronization requirement is that 
after_put must be set to true and false in the inherited methods. Put and 
Get, respectively. In order to meet such a requirement, the inherited methods. 
Put and Get, must be rede9ned, and thus, resulting in HSASA. 

HSASA occurs when it is reqTiired that the newly de9ned methods in a 
subclass must only be invoked after certain inherited methods have been exe- 
cuted, i.e., the invocations of certain methods are history sensitive. Guarded 
methods are inadequate because the newly de9ned (and history sensitive) 
methods wait on those conditions which can only be set in the inherited 
methods, and consequently, rede9nitions become essential. 

3.3 State Modi; cation Anomaly (SMA) 

Consider Figure 3.5, where we have de9ned two classes. Lock and B_Buffer4. 
Lock is a mix-in class [6], which when mixed with another class, gives its 
object an added capability of locking itself. In B_Buffer4 we would like to 
add locking capability to its objects. Thus, the inherited Put and Get in 
B_Buffer4 execute in either a locked or unlocked state. Clearly, the guards 
in the inherited methods. Put and Get, must be rede9ned to account for 
the newly added features. Besides, the invocation of methods of Lock on an 
object of B_Buff er4 ahtects the execution of Put and Get for this object. 

SMA occurs when the execution of a base class method modi9es the con- 
dition(s) for the methods in the derived class. This anomaly is usually found 
in mix-in [6] class based applications. 
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class Lock •[ 
int locked; 
public : 

void lockO 

when (Hocked) 

{ locked = 1 ; }■ 
void unlockO 
when (locked) 

{ locked = 0; } 

>; 



class B_Buffer4: 

public B_Buffer, Lock •[ 

// Unlocked Put and Get. 

>; 



Fig. 3.5. Class de9nitions for Lock and B_Buffer4. 



3.4 Anomaly A 

Some COOPL designers have advocated the use of a single centralized 
class for controlling the invocation of messages received by an object [7]. 
Anomaly A occurs when a new method is added to a base class such that all 
its subclasses are forced to rede9ne their centralized classes. This happens 
because the centralized class associated with a subclass is oblivious of the 
changes in the base class, and therefore, cannot invoke the newly inherited 
method. 

Consider two centralized classes. B_Buf f er_Server and B_Buf f er2_Ser- 
ver, as shown in Figure 3.6, for classes, B_Buff er and B_Buf f er2, respectively. 
If a new method, NewMethod, is added to B_Buffer, it becomes immediately 
visible in B_Buf f er2: however, both B_Buf f er_Server and B_Buf f er2_Server 
are oblivious of such a change and must be rede9ned for their correct use, 
resulting in Anomaly A. 



class B_Buf f er_Server { 

void controllerO { 
switch( . . . ) { 
case...: Put(c); break; 
case...: Get(); break; 

> 

>; 



class B_Buf f er2_Server { 
void controller2 0 { 
switch( . . . ) { 
case...: Put(c); break; 
case...: Get(); break; 
case. . . : GetMorethanOne (n) ; 
break; 

} 

>; 



Fig. 3.6. Centralized server class de9nitions for B_Buffer and B_Buffer2. 
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3.5 Anomaly B 

Instead of using centralized synchronization, each method could maintain 
data consistency by using critical sections. A possible risk with this scheme, 
however, is that a subclass method could operate on the synchronization 
primitives used in the base class, resulting in Anomaly B. 



4 . What Is the Reusability of Sequential Classes? 

The success of an 00 system in a development environment is largely de- 
pendent upon the reusability of its components, namely, classes. A COOPL 
designer must provide means for class reuse without editing previously writ- 
ten classes. Many C-|— |- based COOPLs do not support sequential class 
reuse [21,22]. 

Consider Figure 4.1, where a base class. Base, is de9ned with three meth- 
ods, foo, bar, and baz. We also de9ne a subclass. Derived, of Base, which 
inherits bar and baz, but overrides foo with a new de9nition. 



class Base { 
int a , b , c ; 
public : 

void fooO { 
a = 2; 
b = a*a; 

> 

void barO { 
c = 3 ; 
c = a*c; 

> 

void foobarO ■[ 
foo 0 ; 
bar 0 ; 

> 

>; 



class Derived: public Base 
int d; 
public : 

void foo() { 
bar() ; 
d = c * c ; 



} 



{ 



>; 



Fig. 4.1. The sequential versions of the classes. Base and Derived. 



Let us assume that the parallelism for methods of Base is speci9ed as 
shown in Figure 4.2. The parallelization of Base forces the rede9nition of 
Derived, because otherwise, one or both of the following events may occur: 
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Assume that a message foo is received by an object of Derived. A deadlock 
occurs once the inherited bar is called from within foo which has a receive 
synchronization primitive, but there is no complementary send command. 
Assume further that bar is not called from Derived: :foo. Now, the in- 
herited deQnition of foobar becomes incorrect: foo and bar can no longer 
be enclosed inside a parallel block as these two methods would violate the 
Bernstein conditions [3,17], 



class Base { 
int a,b,c; 
public : 

void fooO { 
a = 2; 

send(a) ; 
b = a*a; 

> 

void barO { 
c = 3 ; 

receive (a) ; 
c = a*c; 

> 

void foobarO { 
cobegin 
fooO ; 
bar 0 ; 
coend; 

> 

>; 



class Derived: public Base 
int d; 
public : 

void fooO { 
barO ; 
d = c * c ; 



} 



{ 



>; 



Fig. 4.2. The parallelized versions of the classes, Base and Derived. 



5. A Framework for Specifying Parallelism 

In the previous section we established that it is extremely di cult to specify 
parallelism and synchronization constraints elegantly in a COOPL. Appar- 
ently, the inheritance anomaly and dubious reuse of sequential classes, make 
COOPLs a less attractive alternative for parallel programming. In the past, 
several researchers have designed COOPLs in an attempt to break these prob- 
lems, but they have been only partially successful. In this section, we propose 
our solution for these problems in a class based COOPL. 
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We have designed a new COOPL, called CORE [21], which is based on 
C++ [11]. In CORE, the parallelism and all the necessary synchronization 
constraints for the methods of a class, are speciDed in an abstract class 
(AC) [6,11] associated with the class. Consequently, a subclass of a superclass 
is able to either: (i) bypass the synchronization code which would otherwise 
be embedded in the inherited methods, or (ii) inherit, override, customize, 
and rede9ne the synchronization code of the inherited methods in an AC 
associated with the subclass. In CORE, we are able to break SPA, SMA, 
Anomaly A, and Anomaly B. However, we arc not completely able to avoid 
HSASA, but we minimize the resulting code rcde9nitions. Besides, the se- 
quential classes are reusable in a CORE program. The CORE framework for 
parallel programming is attractive because of the following reasons: 

synchronization constraints are speci9ed separately from the method code; 
inheritance hierarchies of the sequential and concurrent components are 
maintained orthogonally; 

the degrees of reuse of the sequential and synchronization code are higher; 
and 

parallelism granularity can be more easily controlled. 



6. Previous Approaches 

In ACT++ [18], objects are viewed as sets of states. Concurrent objects are 
designed as sets of states. An object can only be in one state at a time and 
methods transform its state. States are inherited and/or may be re-de9ned in 
a subclass without a method ever needing a re-de9nition; however, sequential 
components cannot be re-used, and the become expression does not the allow 
call/return mechanism of C++. Their proposal suDcrs from SPA. 

In Rosette [32], enabled sets are used to de9ne messages that are allowed 
in the object s next state. The enabled sets are also objects and their method 
invocations combine the sets from the previous states. The authors suggest 
making the enabled sets as Crst-class values. However, their approach is ex- 
tremely complex for speci9cation of concurrency. Additionally this solution, 
too, is inadequate to solve SPA and HSASA. 

The authors of Hybrid [28] associate delay queues with every method. 
The messages are accepted only if the delay queues are empty. The methods 
may open or close other queues. The problem with the delay queue approach 
is that the inheritance and queue management are not orthogonal. Their 
approach is vulnerable to Anomaly B and HSASA. Besides, the sequential 
components cannot be reused. 

EiDel [7] and the family of POOL languages [2] advocate the use of central- 
ized classes to control concurrent computations. The sequential components 
can be reused, however, only one method can execute at a time, and the live 
method must be reprogrammed every time a subclass with a new method is 
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added. The designers of POOL-T [2] disallow inheritance. Both these schemes 
also suEter from Anomaly A. 

Guide [10] uses activation conditions, or synchronization counters, to spec- 
ify an object s state for executing a method. These activation conditions are 
complex expressions involving the number of messages received, completed, 
executing, and message contents, etc.. Clearly, such a speciQcation directly 
con lets with inheritance. Besides, a derived class method can potentially in- 
validate the synchronization constraints of the base class method, and hence, 
faces Anomaly B. 

Saleh et. al. [31] have attempted to circumvent the two problems but they 
restrict the speci9cation of concurrency to intra-object level. They use condi- 
tional waits for synchronization purposes. There are no multiple mechanisms 
for specifying computational granularity and the reusability of sequential 
classes is impossible. 

Aleseguer [26] has suggested the use of order-sorted rewriting logic and 
declarative solutions, where no synchronization code is ever used for avoid- 
ing the inheritance anomaly. However, it is unclear as to how the proposed 
solutions could be adapted into a more practical setting, as in a class based 
COOPT. 

Although, in Concurrent C++ [14], the sequential classes are reusable, 
SPA and SMA do not occur, however, HSASA and Anomaly B remain un- 
solved. Similarly, in CC++ [8], SPA, PIS AS A, and Anomaly B can occur and 
the sequential class reusability is doubtful. 

Much like us, Matsuoka et. al. [24], too, have independently emphasized 
on the localization and orthogonality of synchronization schemes for solv- 
ing the problems associated with COOPLs. They have suggested an elegant 
scheme similar to that of path expressions [3] for solving these problems for 
an actor based concurrent language, called ABCL. In their scheme every pos- 
sible state transitions for an object is speciDed in the class. However, with 
their strategy the reuse of sequential components is improbable. 



7. The Concurrency Abstraction Model 

Recall from the previous section that the occurrence of the inheritance 
anomaly and the dubious reuse of sequential classes in a COOPT arc pri- 
marily due to; (i) the synchronization constraints being an integral part of 
the method bodies, and (ii) the inheritance of the sequential and concurrent 
parts in a method code being non-orthogonal. We propose a novel notion 
of concurrency abstraction as the model for parallel programming in CORE, 
where these two factors are 91tered out, and consequently, the two problems 
associated with a COOPT are solved. We 9rst de9ne following two terms 
before we explain the meaning of concurrency abstraction. 
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De; nition 7.1. A concurrent region is that piece of method code^ (or 
a thread of control), which must be protected using a synchronization con- 
straint. 

De; nition 7.2. An AC is an abstract class® associated with a class, C, 
where the parallelism and the necessary synchronization constraints for the 
concurrent regions of C are speciQed. 

The foundation of the concurrency abstraction model is built on the iden- 
ti9cation of concurrent regions and deOnitions of ACs: The sequential code 
of a concurrent region is customized to a concurrent code (i.e., a piece of 
code which has the speciOed parallelism and synchronization) by using the 
speciQcations in an AC. In a sense a class, C, inherits some attributes 
(speci9 cation of parallelism) for its methods from its AC such that the sub- 
classes of C cannot implicitly inherit them. For a subclass to inherit these 
attributes , it must cxplicitk^o so by dc9ning its AC as a subclass of the 
AC associated with its superclass. 

Thus, a CORE programmer builds three separate and independent in- 
heritance hierarchies: 9rst, for the sequential classes; second, for a class and 
its AC; and third, for the ACs, if required. A hierarchy of ACs keeps the 
inheritance of the synchronization code orthogonal to the inheritance of the 
sequential methods. Such a dichotomy helps a subclass: (a) to bypass the 
synchronization speci9c code which would otherwise be embedded in a base 
class method, and (b) to inherit, override, customize, and rede9ne the syn- 
chronization code of the inherited methods in its own AC. 

We should point out that any speci9 cation inside an AC is treated as 
a compiler directive and no processes get created. These speci9cations arc 
inlined into the method codes by a preprocessor. 

We shall now illustrate the concurrency abstraction model with an ex- 
ample. Consider a class, B, with two methods, bl and b2, which are to be 
concurrently executed for an object of this class. In other words, the method 
bodies of bl and b2 are the concurrent regions. In order to specify concurrent 
execution of these methods, one creates an AC of B, say Syn_B. In Syn_B, the 
two concurrent regions, bl and b2, are enclosed inside a parallel block as 
shown in Figure 7.1(a). Let us consider following three scenarios, where D is 
de9ned as a subclass of B. 

Case 1 : Assume that D inherits bl but overrides b2 by a new de9nition. If 
it is incorrect to concurrently invoke bl and b2 for objects of D, then an 
AC for D is not de9ned (see Figure 7.1(b)). Otherwise an AC, Syn_D, for 
D is de9ned. Two possibilities emerge depending upon whether or not a 
new concurrency speci9cation is needed, i.e. either (i) a new speci9cation 



^ Note that an entire method could be identified as a concurrent region. 

® Conventionally, an abstract class denotes that class in an 00 program for which 
no object can be instantiated [6, 11]. We have followed the same convention in 
CORE. 
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Fig. 7.1. Examples to illustrate the concurrency abstraction model. 



is needed, and hence, a new parallel block enclosing bl and b2 is dc9ncd in 
Syn_D (sec Figure 7.1(c)): or, (ii) the old spcciQcation is reused, and thus, 
Syn_D simply inherits from Syn_B (see Figure 7.1(d)). Notice that when 
Syn_D neither deOnes nor inherits from Syn_B, the speciOed parallelism 
and synchronization in Syn_B are eEtectively bypassed by the inherited 
method, bl, in D. 

Case 2 : Assume that D inherits bl and b2, and de9nes a new method, dl. 
Much like the previous case, a new AC, Syn_D, for D may or may not be 
needed. Assume that Syn_D is needed. In case that bl, b2 and dl require 
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a new concurrency speci9 cation, a new parallel block in Syn_D is de9ned, 
which does not inherit from Syn_B (see Figure 7.1(e)). However, if the old 
speci9cation for bl and b2 is reused but with a new speci9cation for dl, 
then a parallel block enclosing dl is de9ned in Syn_D. Moreover, Syn_D 
inherits from Syn_B (see Figure 7.1(f)). 

Case 3 : Assume that D inherits bl and b2, and de9nes a new method, dl. 
Unlike the previous two cases, assume that one speci9es the concurrency 
and synchronization at the inter-class level rather than at the intra-class 
level. Consider a situation where a function (or a method), foo, de9nes 
a parallel block enclosing objectB.blO and objectD.dlO as two par- 
allel processes. If these two processes communicate, then that must be 
speci9ed in an AC, Syn_B_D, associated with B and D, both. 



8. The CORE Language 

Following our discussions on issues in designing a COOPL, in CORE we 
support: (i) passive objects with an assumption that an appropriate process 
management scheme is available, (ii) schemes for speci9 cation of intra- and 
inter-object parallelism, (iii) multiple synchronization schemes, and (iv) the 
inheritance model of CH — h. Our primary goal is to to avoid the inheritance 
anomaly and to allow the reuse of sequential classes. In CORE, the language 
extensions are based on the concurrency abstraction model as described in 
the previous section. We shall present the syntax of new constructs in CORE 
using the BNF grammar and their semantics informally with example pro- 
grams. 

8.1 Specifying a Concurrent Region 

As de9ned earlier, a concurrent region is that piece of code which is protected 
using a synchronization constraint. Some examples of a concurrent region 
include a critical section of a process, a process awaiting the completion of 
inter-process communication (such as a blocking send or a receive) , a process 
protected under a guard (as in Figure 3.3), etc.. In CORE, a concurrent region 
can be speci9ed at the intra- and inter-class levels by using the reserved words, 
Intra_Conc_Reg and Inter_Conc_Reg, respectively. These reserved words are 
placed inside a class in much the same way as the access privilege speci9ers 
(public, private, protected) are placed in a C+-|- program. The concurrent 
regions do not explicitly encode the synchronization primitives inside the 
class, but they are rede9ned inside their associated ACs. 



8.2 De; ning an AC 

As mentioned earlier, an AC is associated with a class for whose concurrent 
regions, the parallelism and the necessary synchronization constraints are to 
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be speci9ed. In other words, a concurrent region of a class is rede9ned in an 
AC such that the necessary synchronization scheme is encoded in it. Note 
that the members of an AC can access all the accessible members of the class 
it is associated with. An AC can be speci9ed at the intra-class and inter-class 
levels using the reserved words, Intra_Sync and Inter_Sync, respectively. 
The BNF grammar for specifying an AC is given in Figure 8.1. 



(ACspec) 
{Inherit List) 
(ACtag) 
(ACname) 



(ACtag) (ACname) : (InheritList) { (ACdef) } ; 
(classname) (InheritList) \ (ACname) (InheritList) 
IntraSync \ Inter ^Sync 
(identifier) 



Fig. 8.1. The BNF grammar for specifying an AC. 



8.3 De; ning a Parallel Block 

A parallel block [8, 17] encloses a list of concurrent or parallel processes. These 
processes become active once the control enters the parallel block, and they 
synchronize at the end of the block, i.e., all the processes must terminate 
before the 9rst statement after the end of block can be executed. The reserved 
words, Parbegin and Parend, mark the beginning and the end of a parallel 
block, respectively. The BNF grammar for specifying a parallel block in a 
CORE program is shown in Figure 8.2. 



(ParProe) 

(LoopStmt) 

(ProeList) 

(proc) 



Parbegin [(methodname)] [(LoopStmt)] (ProeList) Parend', 
lor ( (identifier) = (initExp) ; (lastExp) ; (incExp) ) 
(proc) ; (ProeList) \ (proe) 

(funetionCall) \ (methodCall) 



Fig. 8.2. The BNF grammar for specifying a parallel block. 



A -ProeList] enlists the parallel processes, where a process from this list 
is either a function call, or a method invocation. A loop may be associated 
with a parallel block by using the for loop syntax of C-f- 1- [11]. In a loop 
version of a parallel block, all the loop iterations are active simultaneously. 
In CORE, another kind of parallel block can be speci9ed by associating a 
-methodname] at the beginning of the block. However, such a speci9cation 
can only be enforced inside an intra-class AC. With such a speci9 cation once 
the process, -methodname], completes its execution, all the processes in the 
block are spawned as the child processes of this process. Note that a similar 
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speci9cation is possible in Mental [15]. We now describe how the intra- and 
inter-object concurrency can be speci9ed in CORE. 

8.3.1 Intra-object Concurrency. Inside a parallel block, method invoca- 
tions on the same object can be speci9ed for parallel execution. Consider a 
class, Foo, and its associated AC, Syn_Foo, as shown in Figure 8.3. Foo has a 
public method, foobar, and two private methods, bar and baz. In Syn_Foo, 
a parallel block is de9ned which speci9es that whenever an object of Foo in- 
vokes foobar, two concurrent processes corresponding to bar and baz must 
be spawned further. The parent process, foobar, terminates only after the 
two children complete their execution. 



class Foo { 
private : 

barO { . . . > 
baz() { . . . > 
public : 
foobar 0 ; 



Intra_Sync Syn_Foo: Foo •[ 
Parbegin foobar () 
bar 0 
baz() 

Parend; 

>; 



Fig. 8.3. A class Foo and its intra-class AC. 



8.3.2 Inter-object Concurrency. Inside a parallel block, method invoca- 
tions on two objects (of the same or diLterent classes) can be speci9ed for 
parallel execution. Consider Figure 8.4, where the function, main, creates 
two objects, master and worker. In main, a parallel block is spcci9cd where 
each object concurrently receives a message, foobar. Earlier, in Figure 8.3, 
we had speci9ed that the methods, bar and baz, be concurrently invoked 
on an object executing foobar. Consequently, main spawns two processes, 
master. foobarO and worker . foobar () , and both these processes further 
fork two processes each. 



int mainO { 

Foo master, worker; 
Parbegin 

master . foobar 0 ; 
worker .foobarO ; 
Parend; 



Fig. 8.4. A program to show speci9cation of inter-object parallelism. 
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8.4 Synchronization Schemes 

The concurrent processes in CORE can interact and synchronize using diDer- 
ent schemes, namely: (i) a mailbox, (ii) a guard, (iii) a pair of blocking send 
and receive primitives, and (iv) a prcde9ned Lock class which implements 
a binary semaphore such that the P and V operations (or methods) can be 
invoked on an object of this class. 

Consider Figure 8.5, where a class, A, is deOned with two methods, ml and 
m2. These methods have been identi9ed as two concurrent regions using the 
tag, Intra_Conc_Reg. We have de9ncd two diDbrent intra-class ACs, Syn_A 
and Syn_Al, for illustrating the use of diEterent synchronization schemes. Note 
that in Syn_A, Send and Receive primitives have been used for synchroniza- 
tion, while a semaphore, sem, has been used in Syn_Al for the same purpose. 



class A i 

Intra_Conc_Reg : 

void methodic) { . . . } 
void method2() { . . . } 

>; 



Intra_Sync Syn_A: A ■[ 
Comm_Buffer buf ; 
void methodic) { 
methodl C) ; 

SendC&buf, writeVar) ; 



Intra_Sync Syn_Al : A { 
Lock sem; 
void methodl C) { 
methodic) ; 
sem. V C) ; 



void method2C) { 

Receive C&buf , readVar) ; 
method2 C) ; 

> 



>: 



void method2C) 
sem. PC) ; 
method2C) ; 

> 



>; 



{ 



Fig. 8.5. A program to illustrate the use of diEterent synchronization schemes 
inside ACs. 



9. Illustrations 

In this section we shall illustrate how the concurrency abstraction model of 
CORE can eltectively support the reuse of a sequential class method in a 
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subclass and avoid the inheritance anomaly without ever needing any redef- 
initions. 

9.1 Reusability of Sequential Classes 

Consider Figure .1, where a classflueue, along with its two methods, add 
and del, are de9ned. Assume that inside a parallel block add and del are con- 



class Queue •[ 
public : 

addCint newitem) { 
temp = • • • ; 
temp->item = newitem; 
temp->next = front_ptr; 
front_ptr = temp; 

> 

delO { 

val = front_ptr->item; 
front_ptr = f ront_ptr->next ; 

> 

>; 



Fig. 9.1. A class de9nition for Queue. 



currently invoked on an object of Queue. Since a shared variable, front_ptr, 
is modi9ed by add and del, it must be protected against simultaneous mod- 
i9cations from these two parallel processes. Let us identify and de9ne two 
concurrent regions, R1 and R2, corresponding to the code segments accessing 
front_ptr in add and del, respectively. In other words, the code segments 
in R1 and R2 correspond to the critical sections of add and del. The trans- 
formed de9nition of Queue, and its associated AC, Syn_Queue, are shown in 
Figure .2. Note that if we inlindll and R2 at their respective call-sites in 
add and del, we get the original (sequential) versions of the add and del 
methods, as in Figure .IRI and R2 are rede9ned in Syn_Queue by enclosing 
them around a pair of P and V operations on a semaphore, sem. These rede- 
9ned concurrent regions are inlined into add and del while generating their 
code. Similarly, an object, sem, is de9ned as an instance variable in Queue. 

Assume that a subclass, SymboUable, is de9ned for Queue, as shown 
in Figure .3.Symbol_Table reuses add, overrides del, and de9nes a new 
method, search. 
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class Queue { 

Intra_Conc_Reg : 

R1 (SomeType* ptr) { 
ptr->next = front_ptr; 
front_ptr = ptr; 

} 

R20 { 

val = front_ptr->item; 
front_ptr = f ront_ptr->next ; 

} 

public : 

addCint newitem) -[ 
temp = . . . ; 
temp->item = newitem; 

R1 (temp) ; 

} 

delO i R20 ; } 

}; 



Intra_Sync Syn_ Queue : Queue { 

Lock sem; 

void Rl() { 
sem.PO ; 

RIO; 
sem.VO ; 

} 

void R2() { 
sem.PO ; 

R20; 
sem.VO ; 

} 

}; 



Fig. 9.2. A complete deBnition for the Queue class. 



While compiling the code for add in Symbol_Table, the untransformed 
inlined code for R1 is used, and hence, its deBnition remains the same as in 
Figure .1, as desired. 



class Symbol_Table : public Queue -[ 
public : 

delO { ... > 
searchO { . . . } 



Fig. 9.3. A class deBnition for Symbol_Table. 



9.2 Avoiding the Inheritance Anomaly 

Anomaly A cannot occur in CORE because the notion of centralized classes 
(as proposed in [7]) does not exist. Furthermore, since the declaration and 
use of diEterent synchronization primitives in a CORE program are restricted 
within an AC, a possible risk of them being operated upon by subclass meth- 
ods, is eliminated. Consequently, Anomaly B is also avoided. 
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We now illustrate using an example how SPA can be avoided in a CORE 
program. The use of guards has been proposed by several researchers for 
avoiding SPA\ we, too, advocate their use. However, while de9ning a subclass 
a CORE programmer is never in a dilemma simply because some other form 
of synchronization scheme has been previously committed to in the base 
class. In contrast with the other COOPL approaches, where the guards can 
only be associated with the methods, in CORE, they can be associated with 
concurrent regions. Consequently, more computations can be interleaved in 
CORE, resulting in a more 9ne-grain speci9cation of parallelism. 



class B_Buffer -[ 

int in, out, n, max; 

char *buf ; 
public : 

B_Buffer (int size) { 
max = size ; 
buf = new char [max] ; 
in = out = n = 0; 

} 

void Put (char c) { Rl(c); I 

int Get (void) ■[ 
char c ; 

R2(&c) ; 
return c ; 

} 

Intra_Conc_Reg : 

void R1 (char c) { 
buf [in] = c ; 
in = (in+1) "/. max; 
n++; 

} 

void R2 (char *c) -[ 

*c = buf [out] ; 
out = (out+1) "/, max; 
n— ; 

} 



Intra_Sync Syn_Bl : B_Buffer { 
Lock empty, full; 

R1 0 f 

empty .P() ; 

RIO; 
full. VO ; 

} 

R2 0 { 
full.PO ; 

RIO; 

empty .V() ; 

} 



Fig. 9.4. A class de9nition for B_Buffer in CORE. 



Let us reconsider the class de9nition for the bounded bu7er problem as 
shown in Figure 3.1. The CORE classes for B_Buffer and B_Buffer2 are 
shown in Figure .4 and Figure .5, respectively. The intra-class ACSyn_Bl 
and Syn_B2, as associated with B_Buf f er and B_Buf f er2, respectively, are also 
shown in these 9gures. Note the use of diEterent synchronization schemes in 
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class B_Buffer2: public B_Buffer { 
public : 

char GetMorethanOne(int howmany) 

char last_char; 

R3(&last_char , howmany); 
return last_char; 

} 

Intra_Conc_Reg : 
void R3 (char *c, int n) 

for (int i=0; i < n; i++) 

*c = Get 0 ; 

> 



Intra_Sync Syn_B2 : 

Syn_Bl, B_Buffer2 { 

R30 f 

if ( in >= out + n) 
R3(); 

} 



Fig. 9.5. A class de9nition for B_Buffer2 in CORE. 



the three concurrent regions: R1 and R2 use semaphores for synchronization, 
and R3 uses a guard. 

In CORE, we are not completely successful in avoiding HSASA, however, 
we minimize the resulting code redeQnitions, as we illustrate below. Let us 
reconsider the class, B_Buffer3, in Eigure 3.4 from section 3.2. The CORE 
class for B_Buffer3 along with its associated intra-class AC, Syn_B3, are 
shown in Eigure .6. Note that the booleanaf ter_put, must be set to true 
and false in the inherited methods, Put and Get, respectively. In addition to 
inheriting Syn_Bl in Syn_B3, we redeQne the synchronization code as shown 
in Figure .6. We are thus able to minimize the code redeQiiition as claimed 
earlier. 



10. The Implementation Approach 

We have proposed CORE as a framework for developing concurrent 00 pro- 
grams STicli that diEterent kinds of inheritance anomalies do not occur and the 
sequential classes remain highly reusable. The code redeOnitions in CORE are 
eEtectively and easily handled by a preprocessor, which customizes in a bot- 
tom up manner the method codes for each class. An obvious problem with 
such a static inlining is that of code duplication and an overall increase in 
code size. However, one can avoid such a code duplication by a scheme similar 
to that of manipulating virtual tables in C++ [11]. Naturally, such a solu- 
tion is more complex and the involved overhead in retrieving the exact call 
through chain of pointers at run-time, makes this approach ine cient. Since 
concurrent programs are targeted for execution on multiple processors and a 
network of workstations, where the cost of code migration is extremely high, 
our choice of code duplication is very well justi9ed. However, the compile 
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class B_Buffer3: public B_Buffer 
public : 

char GetAfterPut () -[ 

R4 0; 

} 

Intra_Conc_Reg : 

R40 { ... > 



{ 

Intra_Sync Syn_B3 : 

Syn_Bl, B_Buffer -[ 

int after_put = 0; 

RK) { 

RIO; 

af ter_put = 1 ; 

} 

R20 { 

R20; 

after_put = 0; 

} 

R40 { 

if ( ! af ter_put 

in >= out + 1) 

R40; 

} 



Fig. 9.6. A class de9nition for B_Buffer3 in CORE. 



lime expansion approach cannot handle previously compiled class libraries; 
they must be recompiled. 



11. Conclusions and Future Directions 

We have proposed a framework for specifying parallelism in COOPLs, and 
have demonstrated that: (a) the inheritance anomaly can be avoided, and (b) 
the sequential classes can be eEfectively reused. We have introduced a novel 
model of concurrency abstraction in CORE and is the key to solving two 
important problems associated with COOPLs. In the proposed model (a) the 
speci9 cation of the synchronization code is kept separate from the method 
bodies, and (b) the sequential and the concurrent parts in the method bod- 
ies of a superclass are inherited by the subclasses in an orthogonal manner. 
In CORE, we avoid state partitioning anomaly (SPA), state modi9cation 
anomaly (SMA), and Anomaly B. We disallowed the notion of a centralized 
class, and hence. Anomaly A can never be encountered in CORE. However, 
the history sensitiveness of acceptable states anomaly [HSASA] can still oc- 
cur in CORE, but with minimal code rede9nitions for the inherited methods. 

We have also established that there is no need for a COOPL designer 
to commit to just one kind of synchronization scheme: multiple synchroniza- 
tion schemes may be allowed in a COOPL. Finally, intra- and inter-object 
parallelism can be easily accommodated in a COOPL, as in CORE. As the 
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proposed concurrency abstraction model is language independent, it can be 
easily adapted into other class based COOPLs. 

In the CORE framework we have not discussed the issues pertaining to 
task partitioning and scheduling, load-balancing, and naming and retrieval 
of remote objects. These issues can potentially expose several interesting 
research problems. Moreover, the model in CORE itself could use work in 
the direction of avoiding HSASA in COOPLs. While the data-parallel model 
of parallel programming has not been the focus of this work, integration 
of inheritance and the speci9cation of the data-partitions, may exhibit an 
anomaly similar to the inheritance anomaly. 
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Summary. This chapter is devoted to a comparative survey of loop parallelization 
algorithms. Various algorithms have been presented in the literature, such as those 
introduced by Allen and Kennedy, Wolf and Lam, Darte and Vivien, and Feautrier. 
These algorithms make use of different mathematical tools. Also, they do not rely 
on the same representation of data dependences. In this chapter, we survey each of 
these algorithms, and we assess their power and limitations, both through examples 
and by stating “optimality” results. An important contribution of this chapter is 
to characterize which algorithm is the most suitable for a given representation 
of dependences. This result is of practical interest, as it provides guidance for a 
compiler-parallelizer: given the dependence analysis that is available, the simplest 
and cheapest parallelization algorithm that remains optimal should be selected. 



1. Introduction 

Loop parallelization algorithms are useful source to source program transfor- 
mations. They arc particularly appealing as they can be applied without any 
knowledge of the target architecture. They can be viewed as a Arst machine- 
independent step in the code generation process. Loop parallelization will 
detect parallelism (transforming DO loops into DOALL loops) and will ex- 
pose those dependences that are responsible for the intrinsic sequentiality of 
some operations in the original program. 

Of course, a second step in code generation will have to take machine pa- 
rameters into account. Determining a good granularity generally is a key to 
eHcient performance. Iso, data distribution and communication optimiza- 
tion are important issues to be considered. But all these problems will be 
addressed on a later stage. Such a two-step approach is typical in the Aeld 
of parallelizing compilers (other examples are general task graph scheduling 
and software pipelining). 

This chapter is devoted to the study of various parallelism detection 
algorithms based on: 

1. simple decomposition of the dependence graph into its strongly con- 
nected components such as lien and KennedyN algorithm [2]. 

2. Unimodular loop transformations, either ad-hoc transformations such as 
BanerjeeN algorithm [3], or generated automatically such as Wolf and 
LamN algorithm [31]. 

3. SchedTiles, either mono-dimensional schedules [10, 12, 19] (a particular 
case being the hyperplane method [26]) or multi-dimensional sched- 
ules [15,20]. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 141-171, 2001. 
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These loop parallelization algorithms are very diZerent for a number of 
reasons. First, they make use of various mathematical techniques: graph algo- 
rithms for (1), matrix computations for (2), and linear programming for (3). 
Second, they take a diZerent description of data dependences as input: graph 
description and dependence levels for (1), direction vectors for (2), and de- 
scription of dependences by polyhedra or aHne expressions for (3). For each 
of these algorithms, we identify the key concepts that underline them, and we 
discuss their respective power and limitations, both through examples and 
by stating optimality“ results. 

n important contribution of this chapter is to characterize which algo- 
rithm is the most suitable for a given representation of dependences, o need 
to use a sophisticated dependence analysis algorithm if the parallelization 
algorithm cannot take advantage of the precision of its result. Conversely, 
no need to use a sophisticated parallelization algorithm if the dependence 
representation is not precise enough. 

The rest of this chapter is organized as follows. Section 2 is devoted to a 
brief summary of what loop parallelization algorithms are all about. In Sec- 
tion 3, we review major dependences abstractions: dependence levels, direc- 
tions vectors, and dependence polyhedra. lien and KennedyM algorithm [2] 
is presented in Section 4 and Wolf and LamM algorithm [31] is presented in 
Section 5. It is shown that both algorithms are optimal" in the class of those 
parallelization algorithms that use the same dependence abstraction as their 
input, i.e. dependence levels for lien and Kennedy and direction vectors for 
Wolf and Lam. In Section 6 we move to a new algorithm that subsumes both 
previous algorithms. This algorithm is based on a generalization of direction 
vectors, the dependence polyhedra. In Section 7 we brie y survey FeautrierN 
algorithm, which relies on exact aHne dependences. Finally, we state some 
conclusions in Section 8. 



2. Input and Output of Parallelization Algorithms 

ested DO loops enable to describe a set of computations, whose size is 
much larger than the corresponding program size. For example, consider n 
nested loops whose loop counters describe a n-cube of size N: these loops 
encapsulate a set of computations of size N'^. Furthermore, it often happens 
that such loop nests contain a non trivial degree of parallelism, i.e. a set 
of independent computations of size (N^) for r 1. 

This makes the parallelization of nested loops a very challenging problem: 
a compiler-parallelizer must be able to detect, if possible, a non trivial degree 
of parallelism with a compilation time not proportional to the sequential 
execution time of the loops. To make this possible, eHcient parallelization 
algorithms must be proposed with a complexity, an input size and an output 
size that depend only on n but certainly not on N, i.e. that depend on the 
size of the sequential code but not on the number of computations described. 
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The input of parallelization algorithms is a description of the dependences 
which link the diZerent computations. The output is a description of an 
equivalent code with explicit parallelism. 

2.1 Input: Dependence Graph 

Each statement of the loop nest is surrounded by several loops. Each itera- 
tion of these loops deAnes a particular execution of the statement, called an 
operation. The dependences between the operations are represented by a 
directed acyclic graph: the expanded dependence graph (EDG). There 
are as many vertices in the EDG as operations in the loop nest. Executing 
the operations of the loop nest while respecting the partial order speciAed 
by the EDG guarantees that the correct result of the loop nest is preserved. 
Detecting parallelism in the loop nest amounts to detecting anti-chains in 
the EDG. We illustrate the notion of expanded dependence graph" with 
the Example 21 below. The EDG corresponding to this code is depicted on 
Eigure 2.1. 



Example 21. 

DO i=l,n 
DO j=l,n 

a(i, j) = a(i-l, j-1) -h a(i, j-1) 
ENDDO 
ENDDO 



0 1 2 3 4 5 i 

Fig. 2.1. Example 21 and its EDG. 

Unfortunately, the EDG cannot be used as input for parallelization al- 
gorithms, since it is usually too large and may not be described exactly at 
compile-time. Therefore the reduced dependence graph (RDG) is used 
instead. The RDG is a condensed and approximated representation of the 
EDG. This approximation must be a superset of the EDG, in order to pre- 
serve the dependence relations. The RDG has one vertex per statement in 
the loop nest and its edges are labeled according to the chosen approximation 
of dependences (see Section 3 for details). Eigure 2.2 presents two possible 
RDGs for Example 21, corresponding to two diZerent approximations of the 
dependences. 

Since its input is a RDG and not an EDG, a parallelization algorithm 
is not able to distinguish between two diZerent EDGs which have the same 
RDG. ence, the parallelism that can be detected is the parallelism contained 
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a) b) 

Fig. 2.2. RDG: a) with dependence levels; b) with direction vectors. 



in the RDG. Thus, the quality of a parallelization algorithm must be studied 
with respect to the dependence analysis. 

For example, Example 21 and Example 22 have the same RDG with 
dependence levels (Figure 2.2 (a)). Thus, a parallelization algorithm which 
takes as input RDGs with dependence levels, cannot distinguish between the 
two codes. owever, Example 21 contains one degree of parallelism whereas 
Example 22 is intrinsically sequential. 

Example 22. 

DO i=l,n 
DO j=l,n 

a(i, j) = 1 + a(i-l, n) + a(i, j-1) 

ENDDO 

ENDDO 



2.2 Output: Nested Loops 

The size of the parallelized code, as noticed before, should not depend on the 
number of operations that are described. This is the reason why the output 
of a parallelization algorithm must always be described by a set of loops ^ . 

There are at least three ways to deAne a new order on the operations of 
a given loop nest (i.c. three ways to deAne the output of the parallelization 
algorithm), in terms of nested loops: 

1. Use elementary loop transformations as basic steps for the algorithm, 
such as loop distribution (as in lien and KennedyM algorithm), or loop 
intorchange and loop skewing (as in BanerjccM algorithm): 

2. pply a linear change of basis on the iteration domain, i.e. apply a uni- 
modular transformation on the iteration vectors (as in Wolf and LamN 
algorithm) . 

3. DeAne a d-dimensional schedule, i.e. apply an aHne transformation from 
Z” to Z^^ and interpret the transformation as a multi-dimensional tim- 
ing function. Each component will correspond to a sequential loop, and 

^ These loops can be arbitrarily complicated, as long as their complexity only 
depends on the size of the initial code. Obviously, the simpler the result, the 
better. But, in this context, the meaning of “simple” is not clear: it depends 
on the optimizations that may follow. We consider that structural simplicity is 
preferable, but this can be discussed. 
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the missing (n (S> d) dimensions will correspond to DOALL loops (as in 
FeautrierM algorithm and Darte and VivienN algorithm). 

The output of these three transformation schemes can indeed be de- 
scribed as loop nests, after a more or less complicated rewriting processes 
(see [8,9, 11,31,36]). We do not discuss the rewriting process here. Rather, 
wc focus on the link between the representation of dependences (the input) 
and the loop transformations involved in the parallelization algorithm (the 
output). Our goal is to characterize which algorithm is optimal for a given 
representation of dependences. ere, optimal “ means that the algorithm 

succeeds in exhibiting the maximal number of parallel loops. 



3. Dependence Abstractions 

For the sake of clarity, we restrict ourselves to the case of perfectly nested 
DO loops with aHne loop bounds. This restriction permits to identify the 
iterations of the n nested loops (n is called the depth of the loop nest) with 
vectors in Z" (called the iteration vectors) contained in a Anite convex 
polyhedron (called the iteration domain) bounded by the loop bounds. The 
Tth component of an iteration vector is the value of the Tth loop counter in 
the nest, counting from the outermost to the innermost loop. In the sequential 
code, the iterations are therefore executed in the lexicographic order of their 
iteration vectors. 

In the next sections, we denote by T> the polyhedral iteration domain, by 
I and J n-dimcnsional iteration vectors in T>, and by Si the Tth statement 
in the loop nest, where f ^s. Wc write / >| J if / is lexicographically 
greater than J and I ( J if / >; J or / = J. 

Section 3.f recalls the diZerent concepts of dependence graphs intro- 
duced in the informal discussion of Section 2.1: expanded dependence graphs 
(EDG), reduced dependence graphs (RDG), apparent dependence graphs 
( DG), and the notion of distance sets. In Section 3.2, we formally deAie 
what wc call polyhedral reduced dependence graphs (PRDG), i.c. reduced 
dependence graphs whose edges are labeled by polyhedra. Finally, in Sec- 
tion 3.3, we show how the model of PRDG generalizes classical dependence 
abstractions of distance sets such as dependence levels and direction vectors. 

3.1 Dependence Graphs and Distance Sets 

Dependence relations between operations are deAned by BernsteinN condi- 
tions [4]. Brie y speaking, two operations are considered dependent if both 
operations access the same memory location and if at least one of the ac- 
cesses is a write. The dependence is directed according to the sequential 
order, from the Ast executed operation to the last one. Depending on the 
order of write(s) and/or read, the dependence corresponds to a so called 
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ow dependence, anti dependence or output dependence. We write: 
Si{I) =1" Sj{J) if statement Sj at iteration J depends on statement Si at 
iteration I. The partial order deAned by describes the expanded de- 
pendence graph (EDG). ote that (J (8> I) is always lexicographically 
nonnegative when Si{I) =| Sj{J). 

In general, the EDG cannot be computed at compile-time, either because 
some information is missing (such as the values of size parameters or even 
worse, precise memory accesses), or because generating the whole graph is 
too expensive (see [35,37] for a survey on dependence tests such as the ged 
test, the power test, the omega test, the lambda test, and [18] for more details 
on exact dependence analysis). Instead, dependences are captured through a 
smaller cyclic directed graph, with s vertices (as many as statements), called 
the reduced dependence graph (RDG) (or statement level dependence 
graph) . 

The RDG is a compression of the EDG. In the RDG, two statements Si 
and are said dependent (we write e : Si i Sj) if there exists at least one 
pair (/' J) such that Si{I) *5', (J). Furthermore, the ^ edge e from Si to 
Sj in the RDG is labeled by the set J) /'V^ JfiSi{I) =| Sj{J)(), or by an 
approximation Dg that contains this set. The precision and the representation 
of this approximation make the power of the dependence analysis. 

In other words, the RDG describes, in a condensed manner, an iteration 
level dependence graph, called (maximal) apparent dependence graph 
(ADG), that is a superset of the EDG. The DG and the EDG have the 
same vertices, but the DG has more edges, deAned by: 

(5'p/)=T (5/ J) (in the DG) ^ 

^ = (S'y Sj) (in the RDG ) such that (/' J) /'Dg\> 

For a certain class of nested loops, it is possible to express exactly this set 
of pairs (/' J) (see [18]): I is given as an aHne function (in some particu- 
lar cases, involving oor or ceiling functions) of J where J varies in a 
polyhedron Vi^j\ 

K(/= J) /V^ *S,(/) =T 5,(J)0= %UAJy J) ^ ^ (3-1) 

In most dependence analysis algorithms however, rather than the set of 
pairs one computes the set Ei^j of all possible values (J ®I). Ei<j is 

called the set of distance vectors, or distance set: 

E,., =1(J0/) *5,(7) S,{J)0 

When exact dependence analysis is feasible. Equation 3.1 shows that the set 
of distance vectors is the projection of the integer points of a polyhedron. 
This set can be approximated by its convex hull or by a more or less accurate 

^ Actually, there is such an edge for each pair of memory accesses that induces a 

dependence between Si and Sj. 
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description of a larger polyhedron (or a Anite union of polyhedra). When 
the set of distance vectors is represented by a Anite union, the corresponding 
dependence edge in the RDG is decomposed into multi-edges. 

ote that the representation by distance vectors is not equivalent to the 
representation by pairs (as in Equation 3.1), since the information concerning 
the location in the EDG of such a distance vector is lost. This may even 
cause some loss of parallelism, as will be seen in Example 64. owever, this 
representation remains important, especially when exact dependence analysis 
is either too expensive or not feasible. 

Classical representations of distance sets (by increasing precision) are: 

level of dependence, introduced in [1,2] for lien and Kennedy!^ paral- 
lelizing algorithm. 

direction vector, introduced by Lamport [26] and by Wolfe in [32,33], 
then used in Wolf and LamN parallelizing algorithm [31]. 
dependence polyhedron, introduced in [22] and used in Irigoin and 
TrioletN supernode partitioning algorithm [23]. We refer to the PIPS soft- 
ware [21] for more details on dependence polyhedra. 

We now formally deAne reduced dependence graphs whose edges are la- 
beled by dependence polyhedra. Then we show that this representation sub- 
sumes the two other representations, namely dependence levels and direction 
vectors. 

3.2 Polyhedral Reduced Dependence Graphs 

We Arst recall the mathematical deAnition of a polyhedron, and how it can 
be decomposed into vertices, rays and lines. 

De nition 31 (Polyhedron, polytope). 

A set P of vectors in Q" is called a (convex) polyhedron if there exists an 
integral matrix A and an integral vector b such that: 

P = Jkv /’Q” ‘ Ax 60 
A polytope is a bounded polyhedron. 

polyhedron can always be decomposed as the sum of a polytope and 
of a polyhedral cone (for more details see [30]). polytope is deAned by its 
vertices, and any point of the polytope is a non-negative barycentric combi- 
nation of the polytope vertices. polyhedral cone is Anitely generated and 
can be deAned by its rays and lines. ny point of a polyhedral cone is the 
sum of a nonnegative combination of its rays and of any combination of its 
lines. 

Therefore, a dependence polyhedron P can be equivalently deAned by a set 
of vertices (denoted by <)>), a set of rays (denoted by lri‘ >l>l>' r,^<)>), 
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and a set of lines (denoted by >i> 0 " L<0). Then, P is the set of all vectors 
p such that: 

9iVi+ (3.2) 

2=1 2=1 i=l 

with 6i /'Q+, 7i /'Q+, 8,: /'Q, and = 1. 

We now deAne what we call a polyhedral reduced dependence graph (or 
PRDG), i.c. a reduced dependence graph labeled by dependence polyhedra. 

dually, we arc interested only in integral vectors that belong to the depen- 
dence polyhedra, since dependence distance are indeed integral vectors. 

De nition 32. A polyhedral reduced dependence graph (PRDG) is 

a RDG, where each edge e : 5^ J, Sj is labeled by a dependence polyhedron 
P{e) that approximates the set of distance vectors: the associated ADG con- 
tains an edge from instance I of node Si to instance J of node Sj if and only 
if (J 0 I) /P{e). 

We explore in Section 6 this representation of dependences, t Arst sight, 
the reader can see dependence polyhedra as a generalization of direction 
vectors. 

3.3 De nition and Simulation of Classical Dependence 
Representations 

We conic back to more classical dependence abstractions: level of dependence 
and direction vector. We recall their dcAnition and show that RDGs labeled 
by direction vectors or dependence levels are actually particular cases of 
polyhedral reduced dependence graphs. 

Direction vectors When the set of distance vectors is a singleton, the depen- 
dence is said uniform and the unique distance vector is called a uniform 
dependence vector. 

Otherwise, the set of distance vectors can still be represented by a n- 
dimensional vector (called the direction vector), whose components belong 
to Z <t4>H\)o(Z 0<(). Its 1-th component is an approximation of the 1-th 

components of all possible distance vectors: it is equal to z-\- (resp. z0) if 
all 1-th components are greater (resp. smaller) than or equal to z. It is equal 
to \if the 1-th component may take any value and to z if the dependence 
is uniform in this dimension with unique value z. In general, + (resp. 0) is 
used as shorthand for 1+ (resp. (01)0). 

We denote by the 1-th canonical vector, i.e. the n-dimensional vector 
whose components are all null except the 1-th component equal to 1. Then, 
a direction vector is nothing but an approximation by a polyhedron, with a 
single vertex and whose rays and lines, if any, are canonical vectors. 

Indeed, consider an edge e labeled by a direction vector d and denote by 
/+, and I the sets of components of d which are respectively equal to 
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z+ (for some integer z), and \ Finally, denote by the n-dimensional 
vector whose f-th component is equal to z if the i-th component of d is equal 
to z, z+ or and to 0 otherwise. 

Then, by deAnition of the symbols +, (S> and \ the direction vector d 
represents exactly all n-dimensional vectors p for which there exist integers 
(?'7 ‘8) in N'*' */N'*^*,/Z* *such that: 



p= dz+ 7Ti <8> Ji^i + 8iCi (3.3) 

q/+ 



In other words, the direction vector d represents all integer points that belong 
to the polyhedron dcAncd by the single vertex dz, the rays for i /'I^ , the 
rays 06^ for i /'I® and the lines for f . 

For example, the direction vector (2+' 3) deAnes the polyhedron with 

one vertex (2'O'0l'3), two rays (I'O'O'O) and (O'O'0l'O), and one line 
(O' I'O'O). 

Dependence levels The representation by level is the less accurate dependence 
abstraction. In a loop nest with n nested loops, the set of distance vectors is 
approximated by an integer I, in [I'n] <)> deAned as the largest integer 

such that the I 0 1 Arst components of the distance vectors are zero. 

dependence at level/ -^n means that the dependence occurs at depth 
I of the loop nest, i.e. at a given iteration of the I 0 I outermost loops. In 
this case, one says that the dependence is a loop carried dependence at 
level /. If / = oc , the dependence occurs inside the loop body, between two 
diZerent statements, and is called a loop independent dependence, 
reduced dependence graph whose edges are labeled by dependence levels is 
called a Reduced Leveled Dependence Graph (RLDG). 

Consider an edge e of level 1. By deAnition of the level, the Arst non-zero 
component of the distance vectors is the /-th component and it can possibly 
take any positive integer value. Furthermore, we have no information on the 
remaining components. Therefore, an edge of level / < oc is equivalent to the 

n®/ 



direction vector: (0 
to the null dependenej^ vector, 
polyhedron, so does 



R ' 1+' and an edge of level oc corresponds 



any direction vector admits an equivalent 
tejiresentatldn-by level. For example, a level 2 depen- 
dence in a 3-diniensional loop nest, means a direction vector (O' 1-1-''^ which 
corresponds to the polyhedron with one vertex (O' TO), one ray (O' TO) and 
one line (O' 0' I). 



4. Allen and Kennedy’s Algorithm 

lien and KennedyM algorithm [2] has Arst been designed to vectorizing loops. 
Then, it has been extended so as to maximize the number of parallel loops 
and to minimize the number of synchronizations in the transformed code. 
The input of this algorithm is a RLDG. 
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lien and KennedyN algorithm is based on the following facts: 

1. loop is parallel if it has no loop carried dependence, i.e. if there is no 
dependence, whose level is equal to the depth of the loop, that concerns 
a statement surrounded by the loop. 

2. 11 iterations of a statements'! can be carried out before any iteration 
of a statement S 2 if there is no dependence in the RLDG from S 2 to Si- 

Property (1) allows to mark a loop as a DOALL or a DOSEQ loop, whereas 
property (2) suggests that parallelism detection can be independently con- 
ducted in each strongly connected component of the RLDG. Parallelism ex- 
traction is done by loop distribution. 

4.1 Algorithm 

For a dependence graph G, we denote by G{k) the subgraph of G in which 
all dependences at level strictly smaller than k have been removed. ere is 
a sketch of the algorithm in its most basic formulation. The initial call is 
Allen-Kennedy(RLDG, 1). 

ALLEN-KENNEDYf'G, k). 

If k > n, stop. 

Decompose G{k) into its strongly connected components Gi and sort them 
topologically. 

Rewrite code so that each Gi belongs to a diZerent loop nest (at level k) 
and the order on the Gi is preserved (distribution of loops at level k). 
For each Gi, mark the loop at level A; as a DOALL loop if Gi has no edge 
at level k. Otherwise mark the loop as a DOSEQ loop. 

For each Gi, call Allen-Kennedy(G!, k + 1). 

We illustrate lien and KennedyN algorithm on the code below: 

Example 41- 

DO i=l,n 
DO j=l,n 
DO k=l,n 

Si: a(i, j, k) = a(i-l, j-H, k) -H a(i, j, k-1) -h b(i, j-1, k) 

S 2 : b(i, j, k) = b(i, j-1, k-bj) -b a(i-l, j, k) 

ENDDO 

ENDDO 

ENDDO 

The dependence graph G = G(l), drawn on Figure 4.1, has only one 
strongly connected component and at least one edge at level 1, thus the 
Ast call Ands that the outermost loop is sequential. owever, at level 2 (the 
edge at level 1 is no longer considered), G(2) has two strongly connected 
components: all iterations of statement S 2 can be carried out before any 
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Fig. 4.1. RLDG for Example 41. 



iteration of statement Si. loop distribution is performed. The strongly 
connected component including Si contains no edge at level 2 but one edge 
at level 3. Thus the second loop surrounding Si is marked DOSEQ and the 
third one DO LL. The strongly connected component including5'2 contains 
an edge at level 2 but no edge at level 3. Thus the second loop surrounding 
Si is marked DO LL and the third one DOSEQ. Finally, we get: 

DOSEQ 1=1, n 
DOSEQ j=l,n 
DOALL k=l,n 

S 2 : b(i, j, k) = b(i, j-1, k+j) + a(i-l, j, k) 

ENDDO 
ENDDO 
DOALL j=l,n 
DOSEQ k=l,n 

Si: a(i, j, k) = a(i-l, j+i, k) + a(i, j, k-1) + b(i, j-1, k) 

ENDDO 

ENDDO 

ENDDO 

4.2 Power and Limitations 

It has been shown in [6] that for each statement of the initial code, as many 
surrounding loops as possible are detected as parallel loops by lien and 
KennedyN algorithm. More precisely, consider a statement S of the initial 
code and j one of the surrounding loops. Then i will be marked as parallel 
if and only if there is no dependence at level i between two instances of S. 
This result proves only that the algorithm is optimal among all parallelization 
algorithms that describe, in the transformed code, the instances of S with 
exactly the same loops as in the initial code. In fact a much stronger result 
has been proved in [17]: 

Theorem 41. Algorithm Allen-Kennedy is optimal among all parallelism 
detection algorithms whose input is a Reduced Leveled Dependence Graph 
(RLDG). 

It is proved in [17] that for any loop nest A/i, there exists a loop nest 
A/ 2 , which has the same RLDG, and such that for any statement S of Afi 
surrounded after parallelization by ds sequential loops, there exists in the 
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exact dependence graph of A /2 a dependence path which includes 
instances of statement S. In other words, lien and KennedyN algorithm 
cannot distinguishes Afi from A /2 as they have the same RLDG, and the 
parallelization algorithm is optimal in the strongest sense on A /2 as it reaches 
on each statement the upper bound on the parallelism deAned by the longest 
dependence paths in the EDG. 

This proves that, as long as the only information available is the RLDG, 
it is not possible to And more parallelism than found by lien and KennedyN 
algorithm. In other words, algorithm Allen- Kennedy is well adapted to 
a representation of dependences by dependence levels. Therefore, to detect 
more parallelism than found by algorithm Allen-Kennedy, more informa- 
tion on the dependences is required. Classical examples for which it is possible 
to overcome algorithm Allen-Kennedy are Example 42 where a simple in- 
terchange (Eigure 4.2) reveals parallelism and Example 43 where a simple 
skew and interchange (Figure 4.3) are suHcient. 

Example 42- 

DO i=l,n 
DO j=l,n 
a(i, j) = 

ENDDO 

ENDDO 



= a(i-l, j-1) -h a(i, j-1) 




Fig. 4.2. Example 42: code and RDG. 



Example 43- 

DO i=l,n 
DO j=l,n 

a(i, j) = a(i-l, j) -h a(i, j-1) 
ENDDO 
ENDDO 




Fig. 4.3. Example 43: code and RDG. 



5. Wolf and Lam’s Algorithm 

Examples 42 and 43 contain some parallelism, that can not be detected by 
lien and KennedyN algorithm. Therefore, as shown by Theorem 41, this 
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parallelism can not be extracted if the dependences are represented by de- 
pendence levels. To overcome this limitation, Wolf and Lam [31] proposed an 
algorithm that uses direction vectors as input. Their work unii^s all previous 
algorithms based on elementary matrix operations such as loop skewing, loop 
interchange, loop reversal, into a unique framework: the framework of valid 
unimodular transformations. 

5.1 Purpose 

Wolf and Lam aim at building sets of fully permutable loop nests. Fully 
permutable loops are the basis of all tiling techniques [5,23,29,31]. Tiling 
is used to expose medium-grain and coarse-grain parallelism. Furthermore, a 
set of d fully permutable loops can be rewritten as a single sequential loop 
and d 0 1 parallel loops. Thus, this method can also be used to express Ane 
grain parallelisirr. 

Wolf and LamM algorithm builds the largest set of outermost fully per- 
mutable^ loops. Then it looks recursively at the remaining dimensions and 
at the dependences not satist^d by these loops. The version presented in [31] 
builds the set of loops via a case analysis of simple examples, and relies on 
a heuristic for loop nests of depth greater than or equal to six. In the rest of 
this section, we explain their algorithm from a theoretical perspective, and 
we provide a general version of this algorithm. 

5.2 Theoretical Interpretation 

Unimodular transformations have two main advantages: linearity and invert- 
ibility. Given a unimodular transformation T, the linearity allows to easily 
check whether T is a valid transformation. Indeed, T is valid if and only if 
Td >i 0 for all non zero distance vectors d. The invertibility enables to rewrite 
easily the code as the transformation is a simple change of basis in Z". 

In general, Td >i 0 cannot be checked for all distance vectors, as there 
are two many of them. Thus, one tries to guarantee Td >i 0 for all non-zero 
direction vectors, with the usual arithmetic conventions in Z ,/ 

1]+'^0). In the following, we consider only non-zero direction vectors, which 
are known to be lexicographically positive (see Section 3.1). 

Denote by t(l), . . . , t{n), the rows of T. Let d be the closure of the cone 
generated by all direction vectors. For a direction vector d: 

Td 0 -t= 1 ^kd 1 ~^i < kd‘ t{i)td = 0 and t{kd)td > 0i> 

This means that the dependences represented by d are carried at loop level 
kd- If kd — 1 for all direction vectors d, then all dependences are carried by 
the Arst loop, and all inner loops are DOALL loops. t(l) is then called a 

^ The i-th and (i-|- l)-th loops are permutable if and only if the i-th and (i-|- l)-th 
components of any distance vector of depth > i are nonnegative. 
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timing vector or separating hyperplane. Such a timing vector exists if 
and only if d is pointed, i.e. if and only if d contains no linear space. This 
is also equivalent to the fact that the cone d + deAned hyd ^ % ♦oa Z' 

9' ytxc 00 is full-dimensional (see [30] for more details on cones and related 
notions). Building T from n linearly independent vectors of 9+ permits to 
transform the loops into n fully permutable loops. 

The notion of timing vector is at the heart of the hyperplane method and 
its variants (see [10,26]), which are particularly interesting for exposing Ane- 
grain parallelism, whereas the notion of fully permutable loops is the basis 
of all tiling techniques. s said before, both formulations are strongly linked 
by 9 + . 

When the cone d is not pointed, d ^ has a dimension r, 1 ^ r < n, 
r = s where s is the dimension of the lineality space of d . With r linearly 
independent vectors of 9 + , one can transform the loop nest so that the r 
outermost loops are fully permutable. Then, one can recursively apply the 
same technique to transform the n® r innermost loops, by considering the 
direction vectors not already carried by one of the r outermost loops, i.e by 
considering the direction vectors included in the lineality space of d . This is 
the general idea of Wolf and LarriN algorithm even if they obviously did not 
express it in such terms in [31]. 

5.3 The General Algorithm 

Our discussion can be summarized by the algorithm Wolf-Lam given below. 

Igorithm Wolf-Lam takes as input a set of direction vectors D and a 
sequence of linearly independent vectors E (initialized to void) from which 
the transformation matrix is built: 

Wolf-Lam(X), E). 

DeAne d as the closure of the cone generated by the direction vectors of D. 
DeAie d~^ = yea: OOand let r be the dimension of d^. 

Complete E into a set E oi r linearly independent vectors of 9+ (by 
construction, if JJ- 9+). 

Let D be the subset of D deAned hy d Z' D 4=o®i/'if' vtd = 0 (i.e. 
D = D E E ^ = D e lin.space(9 )). 

Call Wolf-Lam(Z) , E ). 

ctually, the above process may lead to a non uniniodular matrix. Building 
the desired uniniodular matrix T can be done as follows: 

Let £> be the set of direction vectors. Set if = 9and call Wolf-Lam(I?, if). 
Build a non singular matrix T\ whose Arst rows are the vectors of E (in 
the same order). Let T 2 = where p is chosen so that T 2 is an integral 

matrix. 

Compute the left ermite form ofT 2 , T 2 = Q , where is nonnegative, 
lower triangular and Q is uniniodular. 

is the desired transformation matrix (since pQ'^^D = T\D). 
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We illustrate this algorithm with the following example: 



Example 51. 

DO i=l,n 
DO j=l,n 
DO k=l,n 

a(i, j, k) = a(i-l, j+i, k) + a(i, j, k-1) 
+ a(i, j-1, k+1) 

ENDDO 

ENDDO 

ENDDO 




Fig. 5.1. Example 51: code and RDG. 



The set of direction vectors is _D = ^(1' 0' 0)' (O' 0' 1)‘ (O' T (S>1)0 (see 
Figure 5.1). The lineality space of d (D) is two-dimensional (generated by 
(O' 1' 0) and (O' 0' 1)). Thus, d~^{D) is one dimensional and generated by Ei = 
^(TO'OX) Then D — ^(0' 0' 1)' (O' T 0l)Oand d {D ) is pointed. We com- 
plete El by two vectors of 9 + (T> ), for example by E 2 = 1|(0' T 0)' (O' T 1)0 
In this particular example, the transformation matrix whose rows are Ei‘ E 2 
is already unimodular and corresponds to a simple loop skewing. For expos- 
ing DOALL loops, we choose the Ast vector of E 2 in the relative interior of 
9 + , for example E 2 — 1^(0' 2' 1)' (O' 1' 0)0 In terms of loops transformations, 
this amounts to skewing the loop k by factor 2 and then to interchanging 
loops j and k: 

DOSEQ i=l,n 
DOSEQ k=3,3xn 

DOALL j=max(l, (^^]), min(n, L^^J) 

a(i, j, k-2xj) = a(i-l, j-hi, k-2xj) -h a(i, j, k-2xj-l) -h a(i, j-1, k-2xj-K) 
ENDDO 
ENDDO 
ENDDO 

5.4 Power and Limitations 

Wolf and Lam showed that this methodology is optimal (Theorem B.6. 
in [31]): an algorithm that Ands the maximum coarse grain parallelism, and 

then recursively calls itself on the inner loops, produces the maximum degree 
of parallelism possible“. Strangely, they gave no hypothesis for this theorem. 

owever, once again, this theorem has to be understood with respect to the 
dependence analysis that is used: namely, direction vectors, but without any 
information on the structure of the dependence graph. correct formulation 
is the following: 
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Theorem 51. Algorithm Wolf-Lam is optimal among all parallelism de- 
tection algorithms whose input is a set of direction vectors (implicitly, one 
considers that the loop nest has only one statement or that all statements 
form an atomic block). 

Therefore, as for algorithm Allen-Kennedy, the sub-optimality of algo- 
rithm Wolf-Lam in the general case has to be found, not in the algorithm 
methodology, but in the weakness of its input: the fact that the structure 
of the RDG is not exploited may result in a loss of parallelism. For exam- 
ple, contrarily to algorithm Allen-Kennedy, algorithm Wolf-Lam Aids 
no parallelism in Example 41 (whose RDG is given by Figure 5.2) because of 
the typical structure of the direction vectors (L 0)' (0‘ 1‘ 0)' (0‘ 0‘ 1). 



1 

0 

0 



0 

0 

1 




0 

1 



Fig. 5.2. Reduced Dependence Graph with direction vectors for Example 41. 



6. Darte and Vivien’s Algorithm 

In this section, we introduce a third parallelization algorithm, that takes as 
input polyhedral reduced dependence graphs. We Ast explain our motivation 
(Section 6.1), then we proceed to a step-by-step presentation of the algorithm. 
We work out several examples. 



6.1 Another Algorithm Is Needed 

We have seen two parallelization algorithms so far. Each algorithm may out- 
put a pure sequential code for examples where the other algorithm does And 
some parallelism. This motivates the search for a new algorithm subsum- 
ing algorithms Wolf-Lam and Allen-Kennedy. To reach this goal, one 
can imagine to combine these algorithms, so as to simultaneously exploit the 
structure of the RDG and the structure of the direction vectors: Ast, compute 
the cone generated by the direction vectors and transform the loop nest so as 
to expose the largest outermost fully permutable loop nest; then, consider the 
subgraph of the RDG, formed by the direction vectors that are not carried 
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by the outermost loops, and compute its strongly connected components; Ar 
nally, apply a loop distribution in order to separate these components, and 
recursively apply the same technique on each component. 

Such a strategy enables to expose more parallelism by combining uni- 
modular transformations and loop distribution. owever, it is not optimal 
as Example 61 (Figure 6.1) illustrates. Indeed, on this example, combining 
algorithms Allen-Kennedy and Wolf-Lam as proposed above enables to 
Aid only one degree of parallelism, since at the second phase the RDG re- 
mains strongly connected. This is not better than the basic algorithm Allen- 
Kennedy. owever, one can And two degrees of parallelism in Example 61 by 
scheduling Si{n f k) at time-step Ai®2k and S 2 {k T k) at time-step Ai<^2k+‘i. 



Example 61. 

DO i=l,n 
DO j=l,n 
DO k=l,n 

Si: a(i, j, k) = b(i-l, j-hi, k) -t b(i, j-1, k-h2) 
S 2 : b(i, j, k) = a(i, j-1, k-tj) -h a(i, j, k-1) 
ENDDO 
ENDDO 
ENDDO 



0 0 
1 0 
1 



Si 




S2 



1 0 

1 

0 -2 



Fig. 6.1. Example 61: code and RDG. 



Consequently, we would like to have a single parallelization algorithm 
which Aids some parallelism at least when Allen-Kennedy or Wolf-Lam 
does. The obvious solution would be to try Allen-Kennedy, then Wolf- 
Lam (and even a combination of both algorithms) and to report the best 
answer. But such a naive approach is not powerful enough, because it uses 
either the dependence graph structure (Allen-Kennedy) or direction vec- 
tors (Wolf-Lam), but never beneAs from both knowledges at the same step. 
For example, the proposed combination of both algorithms would use the de- 
pendence graph structure before or after the computation of a maximal set 
of fully permutable loops, but never during this computation. We claim that 
information on both the graph structure and the direction vectors must be 
used simultaneously. This is because the key concept when scheduling RDGs 
is not the cone generated by the direction vectors (i.e. the weights of the 
edges of the RDG), but turns out to be the cone generated by the weights of 
the cycles of the RDG. 

This is the motivation for the multi-dimensional scheduling algorithm 
presented below. It can be seen as a combination of unimodular transforma- 
tions, loop distribution, and index-shift method. This algorithm subsumes 
algorithms Allen-Kennedy and Wolf-Lam. Beforehand we motivate the 
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choice of the representation of the dependences that the algorithm works 
with. 

6.2 Polyhedral Dependences: A Motivating Example 

In this section we present an example which contains some parallelism that 
cannot be detected if the dependences are represented by levels or direction 
vectors. owever, there is no need to use an exact representation of the de- 
pendences to And some parallelism in this loop nest. Indeed, a representation 
of the dependences with dependence polyhedra enables us to parallelize this 
code. 

Example 62. 



DO i = 1, n 
DO j = 1, n 

S: a(i, j) = a(j, i) -h a(i, j-1) 
ENDDO 
ENDDO 



l<i<n, l<j<n 5(i, j) 5(i, j-|-l) 

1 < i < j < n S{i, j) ^ S'(j, i) 

1 < j < i < n 5(j, i) ^ S(i, j) 



Fig. 6.2. Example 62: source code and exact dependence relations 



Consider Example 62 of Figure 6.2. Its exact dependences are listed on the 
same Agure, and Figure 6.3 shows the corresponding (reduced) dependence 
graphs when dependence edges are labeled respectively with levels and direc- 
tion vectors. What is the output of our favorite parallelization algorithms? 



l?CXDIi 

(a) (b) 

Fig. 6.3. RDG for Example 62: (a) by levels, (b) by direction vectors. 



Allen-Kennedy. ere, the levels of the three dependences are respec- 
tively 2, 1, and 1. There is a dependence cycle at depth 1 and at depth 2. 
Therefore, no parallelism is detected. 

Wolf-Lam. ere, the dependence vectors are respectively (01), (-|-‘0), 
and (-|-‘ 0) . In the second dimension, the 1 “ and the 0 “ prevent to detect 
two fully permutable loops. Therefore, the code remains unchanged, and 
no parallelism is detected. 
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Feautrier. This algorithm will be described in Section 7. It takes as input 
the exact dependences. It leads to the valid schedule T[v j) = 2i + j 0 3. 
One level of parallelism is detected. 

In this particTilar example, the representation of the dependences by lev- 
els or by direction vectors is not accurate enough to reveal parallelism. This 
is the reason why Allen-Kennedy and Wolf-Lam are not able to detect 
any parallelism. Exact dependence analysis associated to linear program- 
ming methods that require to solve largc^ parametric linear programs (as 
in FeantrierM algorithm), enables to reveal one degree of parallelism. The 
corresponding parallelized code is: 

DO j = 3, 3n 

DOPAR i = max (l, ( '^^1) , min (n, 

a(b 3 - 2i) = a(j - 2i, i) -|- a(i, j - 2i - 1) 

ENDDO 

ENDDO 

owever, in Example 62, an exact representation of the dependences is 
not necessary to reveal some parallelism. Indeed, one can notice that there is 
one uniform dependence u — (O' 1) and a set of distance vectors 1f(i0h = 
{j 0 i)(l' 01) <I|H j ® i 0 lOthat can be (over)-approximated by the 
set P = ^(T0I) + (T01) <j|k 00 P is a polyhedron with one vertex 
V = (T0l) and one ray r = (l'0l). ow, suppose that we are looking for 
a linear schedule T{v j) = x\i + X 2 j. Let X — (xi'X 2 ). For T to be a valid 
schedule, we look for X such that Xd 1 for any dependence vector d. 
Thus, A(0' 1) 1 and Xp 1 for all p /"P. The latter inequality is equal to: 

X(T0l) -k A(T0l) 1 with 0, which is equivalent to: A(T0I) I 
and A(T0l) 0, i.e. Xv I and Xr 0. Therefore, one has just to solve 
the three following inequalities: 

Xu 1 Xv 1 Xr 0 




which leads, as for Feautrier, to A = (2'1). Thus, for this example, an ap- 
proximation of the dependences by levels or even direction vectors is not 
suHcient for the detection of parallelism. owever, with an approximation 
of the dependences by polyhedra, we And the same parallelism as with exact 
dependence analysis, but by solving a simpler set of inequalities. 

What is important here is the uniformization" which enables us to go 
from the inequality on the set P to uniform inequalities on v and r. Thanks 
to this uniformization, the aHne constraints disappear and we do not need 
to use the aHne form of FarkasMemma anymore as in FeautrierN algorithm 

^ The number of inequalities and variables is related to the number of constraints 
that define the validity domain of each dependence relation. 
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(see Section 7). To better understand the uniformization“ principle, think in 
terms of dependence path. The idea is to consider an edge e, from statement 
S to statement T and labeled by the distance vector p = v + f as a path 
HtYisX uses once the uniform “ dependence vectoru and times the uni- 
form" dependence vector r. This simulation is summarized in Figure 6.4; we 
introduce a new node S that enables to simulate i7and a null- weight edge to 
go from S back to the initial node T . This uniformization" principle is the 
underlying idea of the loop parallelization algorithm described in this section. 



P = (v,r) 




S T 




Fig. 6.4. Simulation of an edge labeled by a polyhedron with one vertex and 
one ray. 



By uniformizing the dependences, we have in fact uniformized" the con- 
straints and transformed the underlying aHne scheduling problem into a sim- 
ple scheduling problem where all dependences are uniform (u, v, and r). ow- 
ever, there are two fundamental diZerences between this framework and the 
classical framework of uniform loop nests: 

The uniform dependence vectors are not necessarily lexico-positive (for 
example, a ray can be equal to (O' 01)). Therefore, the scheduling problem 
is more diHcult. owever, it can be solved by techniques similar to those 
used to solve the problem of compTitability of systems of uniform recurrence 
equations [24]. 

The constraint imposed on a ray r is weaker than the classical constraint: 
the constraint is indeed Xr 0 instead of Xr 1. This freedom must be 
taken into account by the parallelization algorithm. 

6.3 Illustrating Example 

We work out the following example, assuming that in the reduced dependence 
graph, edges are labeled by direction vectors. The dependence graph, depicted 
in Figure 6.5, was built by the dependence analyzer Tiny [34]. 

The reader can check that neither Allen-Kennedy, nor Wolf-Lam, is 
able to Aid the full parallelism for this code: the third statement seems to 
be purely sequential. owever, the parallelism detection algorithm that we 
propose in the next sections is able to build the following multi-dimensional 
schedule: {2i -|- l'2fc) for the Ast statement, (2Fj) for the second statement 
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Example 63. 

DO i = 1, n 
DO j = 1, n 
DO k=l, j 

a(i, j, k) = c(i, j, k-1) + 1 

b(i, j, k) = a(i-l, j+i, k) + b(i, j-1, k) 

c(i, j, k+1) = c(i, j, k) + b(i, j-1, k-hi) 

+ a(i, j-k, k-hl) 

ENDDO 

ENDDO 

ENDDO 



1 

0 




Fig. 6.5. Example 63: code and RDG. 



and {2i + 1^2k + 3) for the third statement. This schedule corresponds to 
the code with explicit parallelism given below (but in which no eZort, such 
as loop peeling, has been made so as to remove if“ tests). Thus, for each 
statement, one level of parallelism can be detected. 

DOSEQ i = 1, n 
DOSEQ j = 1, n 
DOPAR k = 1, j 

b(i, j, k) = a(i-l, j-K, k) -h b(i, j-1, k) 

ENDDO 

ENDDO 

DOSEQ k = 1, n-|-l 
IF (k < n) THEN 
DOPAR j = k, n 
a(i, j, k) = c(i, j, k-1) + 1 
ENDDO 
ENDIF 

IF (k > 2) THEN 
DOPAR j = k-1, n 

c(i, j, k) = c(i, j, k-1) -H b(i, j-1, k-hi-1) -h a(i, j-k-K, k) 

ENDDO 

ENDIF 

ENDDO 

ENDDO 

This code has been generated, from the schedule given above, by the 
procedure codegen “ of the Omega Calculato5' delivered with Petit [25]. We 
point out that the code proposed above is a virtual “ code in the sense that it 
only reveals hidden parallelism. We do not claim that it must be implemented 
as such. 



® The Omega Calculator is a framework to compute dependences, to check the 
validity of program transformations, and to transform programs, once the trans- 
formation is given. 
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6.4 Uniformization Step 

We Arst show how PRDGs (polyhedral reduced dependence graphs) can be 
captured into an equivalent (but simpler to manipulate) structure, the struc- 
ture of uniform dependence graphs, i.e. graphs whose edges are labeled by 
constant dependence vectors. This uniformization scheme is achieved by the 
translation algorithm given below. 

To avoid possible confusions between the vertices of a dependence graph 
and the vertices of a dependence polyhedron, wc call the Aist one nodes 
instead of vertices. Furthermore, the initial PRDG that describes the de- 
pendences in the code to be parallelized is called the original graph and 
denoted by Go = (V^E). The uniform RDG, equivalent to Go and built by 
the translation algorithm, is called the uniform graph or the translated 
of Go, and is denoted by = (VF F). 

The translation algorithm builds by scanning all edges of Go- It starts 
from Gu = (hF F) = (F 9), and, for each edge e of E, it adds to new nodes 
and new edges depending on the polyhedron P{e). We call virtual nodes 
the nodes that are created, as opposed to actual nodes which correspond to 
nodes of Go- 

Let e be an edge of E. We denote by and ye, respectively the tail and 
head of e, i.e. the nodes that e respectively leaves and enters: Xe ye- 
This deAnition is generalized to paths: the head (resp. tail) of a path is the 
head (resp. tail) of its last (resp. Arst) edge. 

We follow the notations introduced in Section 3.2: L, Mand denote the 
number of vertices v^, of rays r*, and of lines li of the polyhedron P{e). 

Translation Algorithm- 

Let W = V ‘emd F ^ 3 
For e : x_e | y_e /'E do 

dd toVF a new virtual node Ug, 

dd toF L edges of weights t>i, V 2 , . . . , directed from Xg to Ug, 
dd toF Mself-loops around Ug of weights ri, r 2 , . . . , 
dd toF self-loops around Ug of weights li, I 2 , - - - , l^ 
dd toF self-loops around Ug of weights ^l 2 , - - ■ , 
dd toT a null weight edge directed from ng to yg- 

Back to Example 63. The PRDG of Example 63 is drawn in Figure 6.5. 
Figure 6.6 shows the uniform dependence graph associated to it. It has three 
new nodes in gray (i.e. virtual nodes) that correspond to the symbol and 

the two symbols (S)“ in the initial direction vectors. 

6.5 Scheduling Step 

The scheduling step takes as input the translated dependence graph G„ and 
builds a multi-dimensional schedule for each actual node, i.e. for each node 
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Fig. 6.6. Translated uniform reduced dependence graph. 

of Gu that corresponds to a node of Go- G^ is assumed to be strongly con- 
nected (otherwise the algorithm is called for each strongly connected compo- 
nent of G„). 

This is a recursive algorithm. Each step of the recursion builds a particular 
subgraph G of the current graph G being processed. Once G is built, a set of 
linear constraints is derived and a valid schedule that satisAes all dependence 
edges not in G can be computed. Then, the algorithm keeps working on the 
remaining edges, i.e. the edges of G (more precisely G and some additional 
edges, see below). 

G is deAiied as the subgraph of G generated by all the edges of G that 
belong to at least one multi-cycle of null weight. multi-cycle is a union of 
cycles, not necessarily connected, and the weight of a union of cycles is the 
sum of the weights of its constitutive cycles. G is built by the resolution of 
a linear program (see Section 6.6). 

The scheduling step can be summarized by the recursive algorithm given 
below. The initial call is Darte-Vivien(G„, 1). The algorithm builds, for 
each actual node S of G„, a sequence of vectors Xg, . . . , Xg^ and a sequence 
of constants fl^, . . . , that deAne a valid multi-dimensional schedule. 
DARTE-VIVIEN(G, k). 

1. Build G the subgraph of G generated by all edges that belong to at least 
one null weight multi-cycle of G. 

2. dd iiiG , all edges from to ye and all self-loops on ye 'A e = (xe‘ye) 
is an edge already in G , from an actual node Xe to a virtual node ye- 

3. Select a vector X, and a constant A§ for each node S in G, such that: 

/ e = {Xe‘ ye) /'G or Xe is a virtual node | Xw{e) + 0 0 

\ e — {xe‘ ye) A^G and Xe is an actual node t Xw{e) + 0 1 

For all actual nodes S of G, let 1^ = and Xg = X . 
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4. If G is empty or has only virtual nodes, return. 

5. If G is strongly connected and has at least one actual node, G is not 
computable (and the initial PRDG G<, is not consistent), return. 

6 . Otherwise, decompose G into its strongly connected components Gi and 
call D RTE-VIVIE ((?,, k + \) for each subgraph Gi that has at least 
one actual node. 

Remarks 

Step (2) is necessary only for general PR.DGs: for example, it could be 
removed for R.DGs labeled by direction vectors (for details see [16]). In 
this case, the resolution of a single linear program can simultaneously solve 
Step (1) and Step (3). 

In Step (3), we do not specify, on purpose, how the vector X and the con- 
stants Afare selected, so as to allow various selection criteria. For example, 
a maximal set of linearly independent vectors X can be selected if the goal 
is to derive fully perniutable loops (see [13] for details). 

Back to Example 63 Consider the uniform dependence graph of Figure 6.6. 
There are two elementary cycles of weights (T 0' 1) and (O' T 1), and Ave self- 
loops of weights (O'O'l), (O'O' 0 l), (O' TO) (twice) and (0'(g)T0). Therefore, 
all edges (except the edges that only belong to the cycle of weight (TO' 1 )) 
belong to a multi-cycle of null weight. The subgraph G is drawn in Figure 6.7. 




St 




Fig. 6.7. Subgraph of null weight multi-cycles for Example 63. 



The constraints coming from edges in G impose that X = (x' y' z) must 
be orthogonal to the weight of all cycles of G . Therefore, y = z = 0. Fi- 
nally, considering the other constraints, we And the solution X — (2'0'0), 
= AI 3 = I and = 0. In G , there remain four strongly connected 
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components, and two of them are not considered since they only have vir- 
tual nodes. The two other components have no null weight multi-cycles. The 
strongly connected component with the single node S 2 can be scheduled with 
the vector X = (0‘ 1' 0), whereas studying the other strongly connected com- 
ponent leads, among other solutions, to X = (0'0'2), = 0, and A'§^ = 3. 

Finally, summarizing the results, we And, as claimed in Section 6.3, the 
2-dimensional schedules: j) for 82 - (2i + 1‘ 2k) for 5i and [2i + V2k + 3) 

for 53 . 



6.6 Schematic Explanations 

Gu does not always correspond to the RDG of a loop nest since its dependence 
vectors are not necessarily lexicographically nonnegative. In fact, if one for- 
gets that some nodes are virtual, G„ is nothing but the reduced dependence 
graph of a System of Uniform Recurrence Equations (SURE), introduced by 
Karp, Miller and Winograd [24]. Karp, Miller and Winograd study the prob- 
lem of computability of a SURE: they show that its computability is linked 
to the problem of detecting cycles of null weight in its RDG G, which can 
be done by a recursive decomposition of the graph, based on the detection 
of multi-cycles of null weight. The key structure of their algorithm is G , the 
subgraph of G generated by the edges that belong to a multi-cycle of null 
weight. 

G can ellciently be built by the resolution of a simple linear program 
(program 6.1 or its dual program 6.2). This resolution enables to design a 
parallelization algorithm, whose principle is dual to Karp, Miller and Wino- 
gradM algorithm: 



Vf, Jit q 0‘ u 0‘ te q + v — I + w^ Bq — 0 / (6.1) 



Ze Jit z 0 ' O^Ze^B Xw{e) + ® Me (6-2) 



where w(e) is the dependence vector associated to the edge e, B = [GIF]*, G 
is the connection matrix and W the matrix of dependence vectors. 

Without entering the details, X is a n-dimensional vector and there is 
one variable A/per vertex of the RDG and one variable z per edge of the 
RDG. The edges of G (resp. G flG ) arc the edges e == {xe/Ve) for which 
Ze = 0 (resp. Ze = 1) in the optimal solution of the dual (program 6.2), and 
equivalently, for which Ue = 0 (resp. Ve ~ 1) in the primal (program 6.1). 
When summing inequalities Xw{e) + M/e ® Me F on a cycle G of G, one 
Ands that Xw(C) = 0 if G is a cycle of G and Xw(C) 1{C) > 0 otherwise 
(/(G) is the number of edges of G not in G ). 

To see the link with algorithm Wolf-Lam, when considering the cone d 
generated by the weights of the cycles (and not the weights of the edges), G 
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is the subgraph whose cycle weights generate the lineality space of d and X 
is a vector of the relative interior of 9 + . owever, there is no need to build3 
eZectively to build G . This is one of the interest of the linear programs 6.1 
and 6.2. 

We have outlined the main ideas of algorithm Darte-Vivien [15]. Some 
technical inodiAcations are needed to distinguish between virtual and actual 
nodes, and to take into account the nature of the edges (vertices, rays or lines 
of a dependence polyhedron): see [16] for full details. 

6.7 Power and Limitations 

ow that we have a rnulti-dimensional scheduleT, we can prove its optimality 
in terms of degree of parallelism. We can show [14, 16] that for each statement 
S (i.c. for each node of Go), the number of instances of S that have been 
sequentialized by T is of the same order as the number of instances of S that 
are inherently sequentialized by the dependences. 

Theorem 61. The scheduling algorithm is nearly optimal: if the iteration 
domain contains (resp. is contained in) a full dimensional cube of size (N) 
(resp. 0{N)), and if d is the depth (the number of nested recursive calls) of 
the algorithm, then, the latency of the schedule is 0{N'^) and the length of 
the longest dependence path is {N’^). More precisely, after code generation, 
each statement S is surrounded by exactly ds sequential loops and these loops 
a, re considered inherently sequential because of the dependence analysis. 

Once again, this algorithm is optimal with respect to the dependence 
analysis. Consider the example in Figure 6.8. 



Example 6). 

DO i=l,n 
DO j=i,n 

Sx a(i, j) = b(i-l, j+i) + a(i, j-1) 
Si b(i, j) = a(i-l, j-i) + b(i, j-1) 
ENDDO 
ENDDO 



1 

2 




Fig. 6.8. Example 64: code and RDG. 



If the dependences are described by distance vectors, the RDG has two 
self-dependences (0‘ 1) and two edges labeled by polyhedra, both with one 
vertex and one ray (respectively (0‘ 1) and (0‘g)l)). Therefore, there exists a 
multi-cycle of null weight. Furthermore, the two actual vertices belong to G . 
Thus, the depth of algorithm Darte-Vivien is 2 and no parallelism can be 
found. 
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owever. computing iteration j) of the Arst statement (resp. the second 
statement) at step 2i + j (resp. i + j), leads to a valid schedule that exposes 
one degree of parallelism Darte-Vivien was not able to And parallelism 
in this example because the approximation of the dependences had already 
lost all the parallelism. 

The technique we used here to detect parallel loops consists in looking for 
multi-dimensional schedules whose linear parts (the vectors X) may be diZer- 
ent for diZerent statements even if they belong to the same strongly connected 
component. This is the base of FcautrierN algorithm [20] whoso fundamental 
mathematical tool is the aHne form of FarkasMemma. Theorem 61 however 
shows that there is no need to look for diZerent linear parts (whose construc- 
tion is more expensive and lead to more complicated rewriting processes) in 
a given strongly connected component of the current subgraph G , as long as 
dependences are given by distances vectors. On the other hand, Example 64 
shows that such a reAnement may be useful only when a more accurate de- 
pendence analysis is available. 



7. Feautrier’s Algorithm 

In [20], Paul Feautrier proposed an algorithm to schedule static control pro- 
grams with aHne dependences. This algorithm makes use of an exact depen- 
dence analysis, which is always feasible for such programs [18]. This is to 
be contrasted with the previous three algorithms (Allen-Kennedy, Wolf- 
Lam, and Darte-Vivien) which work with an approximation of the depen- 
dences. 

FeautrierN algorithm takes as input a reduced dependence graph G in 
which an edge e : Si i Sj is labeled by the set of pairs (/' J) such that Sj{J) 
depends on Si{I). This algorithm builds recursively a multi-dimensional aHne 
schedule for each statement of the loop nest: 

Feautrier(G) . 

Decompose G into its strongly connected component Gi and sort them 
topologically. 

For each strongly connected component 

Find an aHne schedule by statement which induces a non-negative delay 
on all dependences and satisAs as many dependences as possible. 

Build the set G- of unsatisAed edges. If Gj ^3, call Feautrier(G-). 

This algorithm is similar to Darte- Vivien because of its structure and 
output, and because both use linear programs to build aHne schedules. ere 
are the main points for a comparison of the two algorithms: 



® The schedules [|i + i + and + j\ minimize the latency but the code is 
more complicated to write. 
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Darte- Vivien is able to schedule ^ programs even if dependence analysis 
is not feasible, given a RDG with polyhedral dependences. Feautrier is 
only able to process static control programs with aHne dependences. In 
this sense, the Arst algorithm is more powerful. ote however that there 
are some attempts to generalize FeautrierM approach by weakening the 
constraints on its input, using a Fuzzy rray Data ow nalysis [7]. 
When dependence analysis is feasible, Feautrier is much more powerful. 
This algorithm is able to process any set of loops that describe polyhedra, 
even if the loops are not perfectly nested. Darte- Vivien can also process 
non perfectly nested loops, either by considering each block of perfectly 
nested loops separately, or by fusing artiAcially the non perfectly nested 
loops. In theory however, this is less natural and less powerful. 

Darte- Vivien is based on the resolution of linear programs that are sim- 
ilar to those solved by Feautrier. The only (through fundamental) dif- 
ference is that the former looks for less general aHne transformations. 
Therefore, on static control programs with aHne dependences, Feautrier 
always And more parallelism than Darte-Vivien (cf. Example 64). ow- 
ever, despite this diZerence, the optimality result for Darte-Vivien gives 
some hints concerning the optimality cases of Feautrier that was Arst 
presented as a greedy heuristic". 

Feautrier needs to use the aHne form of FarkasNlemma to obtain its 
linear programs, which Darte-Vivien avoids thanks to its uniformization 
scheme. Therefore, FeautrierM linear programs are more complex. 

Both algorithms were extended from Ane grain to medium grain parallelism 
detection through a search for fully permutable loops. Darte et al. [13] 
proposed an extension of Darte- Vivien which is a mere generalization 
of Wolf-Lam. Lim and Lam [27] proposed an extension of Feautrier 
which Ands maximal sets of fully permutable loops while minimizing the 
amount of synchronizations required in the parallelized code. 
Darte-Vivien produces schedules as regular as possible in order to gen- 
erate codes as simple as possible. Indeed, this algorithm rewrites the codes 
using aHne schedules, but, unlike Feautrier, these aHne schedules are 
chosen such that as many statements as possible have the same linear part: 
the code generation can then be viewed as a sequence of partial uniniodular 
transformations and loop distributions. s a result, the output codes are 
guaranteed to be simpler than FeautrierM codes. 

small comparison study was conducted in [28] . It used only four exam- 
ples. s expected, the complexity of Darte- Vivien was (much) lower than 
that of Feautrier. More surprisingly, both algorithms output the same re- 
sult on each of the four examples considered. Obviously, more real examples 
should be processed to reach a conclusion. t least we can say that more 
complex techniques do not always provide better results! 



^ We did not write “is able to rewrite”... 
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Finally, here is a code (Example 71) which obviously contain some par- 
allelism, but which cannot be parallelized by any of the four parallelization 
algorithms surveyed in this paper: 



Example 71. 

DO i=l,n 
a(i) = 1 + a(n-i) 
ENDDO 



DOPAR 1=1, LfJ 
a(i) = 1 + a(n-i) 
ENDDO 

DOPAR i=[fj + l,n 
a(i) = 1-1- a(n-i) 
ENDDO 



Fig. 7 . 1 . Example 71: original code and parallelized version. 



8. Conclusion 

Our study provides a classiA:;ation of loop parallelization algorithms. The 
main results are the following: lien and KennedyN algorithm is optimal for 

a representation of dependences by levels, and Wolf and LaniN algorithm is 
optimal for a representation by direction vectors (but for a loop nest with 
only one statement). either one subsumes the other, since each uses infor- 
mation that cannot be exploited by the other (graph structure for the Arst 
one, direction vectors for the second one). owever, both are subsumed by 
Darte and VivienN algorithm which is optimal for any polyhedral representa- 
tion of distance vectors. FeautrierN algorithm subsumes Darte and VivienN 
algorithm when dependences can be represented as aHne dependences, but 
the characterization of its optimality remains open. 

We believe this classiAcation of loop parallelization algorithms to be of 
practical interest. It provides guidance for a compilcr-parallelizer in order to 
choose the most suitable algorithm: given the dependence analysis that is 
available, the simplest and cheapest parallelization algorithm that remains 
optimal should be selected. Indeed, this is the algorithm that is the most 
appropriate to the available representation of dependences. 
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Summary. While mathematical reasoning is about fixed values, programs are 
written in term of memory cells, whose contents are changeable values. To rea- 
son about programs, the first step is always to abstract from the memory cells to 
the values they contains at a given point in the execution of the program. This 
step, which is known as Dataflow Analysis, may use different techniques according 
to the required accuracy and the type of programs to be analyzed. 

This paper gives a review of the ad hoc techniques which have been designed 
for the analysis of Array Programs. An exact solution is possible for the tightly 
constrained static control programs. The method can be extended to more general 
programs, but the results are then approximation to the real dataflow. Extensions 
to complex statements and to the interprocedural case are also presented. 

The results of Array Dataflow Analysis may be of use for program checking, 
program optimization and parallelization. 
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vot [Les96]. 



1. Introduction 

There are many situations in which one needs to thoroughly understand the 
behavior of a program. The most obvious one is at program checking time. 
If we could extract a description of a program as, e.g., a set of mathematical 
equations and compare it to a specij cation, also given in the same medium, 
debugging would become a science instead of an art. Reverse engineering is 
another case in point. But the most important application of such analyses is 
to optimization. ach optimization has to be proved valid in the sense that 
it does not modij es the program ultimate results. To achieve this, we have 
to know, in a more or less precise way, what these results are intended to be. 
Since the most aggressive type of optimization a program can be subjected to 
is parallelization, understanding a program before attempting to parallelize 
it is a very important step. 

Now, since the time of Von Neuman, programs are written in term of 
Bvariables which are in fact symbolic names for memory cells. Values are 
never given^, or even named, but always alluded to as Ethe present content 

^ except in the case of constants. 
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of cell X . On the other hand, in mathematics, the subject of discourse is 
always a value which never change, albeit it can be unknown or arbitrary. 
The value in a given memory cell can be modeled as a function of time (that 
function may be constant). 

Obviously, Etime here is not physical time. Besides the fact that exhibit- 
ing such a function would be nearly impossible, it would have the added 
inconvenience of not being portable among dihferent computers. We will use 
a logical time, to be dej ned later. The only requirement is that there must be 
a Etime arrow : time must belong to an ordered set. Since the state of a com- 
puter memory does not change except at each execution of an assignment, 
logical time is not continuous but discrete. ach time step is anoperation of 
the computer, which corresponds, from the point of view of the programmer, 
to the execution of an instruction. For program analysis purposes, there is 
some leeway in the dej nition of an operation. It may be the execution of 
a machine instruction, as in the case of Instruction Level Parallelization, or 
the execution of an assignment statement, as in most of this paper, or the 
execution of a complex statement, as in Sect. 4. 

If we stipulate that the meaning of a program is given by expressing 
the value of variables as a function of (logical) time, then data ow analysis 
is the process of extracting properties of these functions from the program 
text. These properties may be of widely varying precision. In some cases, one 
may exhibit a closed formula for the function. In other cases, one may only 
knows that it has positive values. In the most frequent cases, one has to be 
content with relations between values taken either at the same time (FloydQ 
assertions [Flo67]) or at different times. As before, these relations may be 
more or less precise. We will show that, for a simple but useful category of 
programs, the result of Array Data ow Analysis is a system ofequations 
relating the values of variables at distinct time points. 

Data ow analysis is based on the observation that the value one may 
retrieve from a memory cell is the one which was written last. In the scalar 
case, this allows one to write data ow equations, which may be solved either 
by iterative methods or by direct methods. In the case of array cells, the 
problem is more diXcult because there is no simple method for deciding if 
two references to the same array are references to the same cell or not: two 
occurrences of a[i] are references to the same cell iKi has not been modi; ed 
in between. Conversely, it may happen that a[i] and a[j] refer to the same 
cell if the values i and are equal. 

There is a general method for devising data ow analyses [CC77]. One 
starts from a semantical description of the source language, and then one ab- 
stracts the features of interest by constructing a nonstandard semantics. The 



^ We will adhere to the following convention: identifiers will always be written 
in a Teletype font. Their values at a given time will always be denoted by the 
same letter in an italic font. If necessary, the time will be indicated by various 
devices (accents, subscripts, arguments). 
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result of executing a given program according to this semantics, if possible, 
is the required property. 

Our main interest here is another type of analysis which has been designed 
in an ad hoc way for the use of automatic parallelizers. The initial concept was 
that of dependences. There is a ow dependence between statements i and S 2 
iKa value produced by Si may be used later by S 2 . By restricting the allowed 
expressions in subscripts and loop bounds to aXne expressions, the problem 
reduces to the question of the feasibility in integers of a system of aXne 
inequalities. The problem is solved by standard Linear Integer Programming 
algorithms. It was soon realized [Fea88a] that the same technology could give 
much more precise results. For programs abiding to the same restrictions as 
above, and for each value in the program, one can pinpoint its source, i.e. the 
name of the write operation which created it. This information is invaluable 
for program checking, program understanding (a.k.a. reverse engineering), 
program optimization and parallelization. 

Program whose only control structure is the do loop, whose only data 
structure is the array and in which loop bounds and STibscripts are aXne 
bmetions are known as static control programs. For such programs, one can 
take iteration vectors (the vectors whose components are the current values of 
the loop counters) as logical time. It follows that, under the above hypotheses, 
array subscripts are closed functions of (logical) time. This is the crucial 
property which allows us to j nd relations between the other values in the 
program. For program which are outside the static control model, devising 
an Array Data ow Analysis is much more diXcult. A j rst possibility is to 
extend slightly the control model by adding conditionals and while loops. 
The iteration count of a while loop cannot be bounded at compile time. The 
consequence is that, if these iterations can be the source of a value, then we 
cannot | nd the last one. In that case, all we can do is to report that the 
source belongs to a set of iterations. The result of our analysis is no longer 
sources, but source sets, and our aim will be to j nd the smallest possible 
source sets. The corresponding technique is known as Fuzzy Array Data ow 
Analysis and is presented in Sect. 3. It can be extended to the case where 
some subscripts are no longer aXne functions [BCF97]. 

Wc will next present some extensions of ADA. The j rst one is to state- 
ments which may return an unbounded number of results. Typical cases are 
read statements, vector statements k la Fortran 90 and forall statements 
k la HPF (Sect. 4). Procedures may return an unbounded number of results 
as soon as they have at least one array argument. Hence, they belong to the 
above category and can be treated in the same way as, e.g., vector operations. 

In the conclusion, we sketch some applications of Array Data ow Anal- 
ysis and point to several unsolved problems. Basic mathematical tools are 
presented in the appendix. 
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2. Exact Array Dataflow Analysis 

xact Array Data ow Analysis is possible only in the case of static control 
programs. We will j rst describe this program model. The results of exact 
ADA are source functions, which give, for each step in the execution of the 
source program and for each memory cell, the operation which has generated 
the current value of the memory cell. We give an algorithm for computing 
source functions and compare it to other proposals from the literature. 



2.1 Notations 

The objects we have to handle in this paper are mainly vectors with integer 
coordinates and set of such vectors. Jk4>is the dimension of a. a[it*g'] is the 
subvector of a built from components i to j. a[i] is a shorthand for a[i»*]. 
Familiar operators and predicates like + and will be tacitly extended to 
vectors. The sign denote lexical ordering of vectors. The max operator, 
when acting on vectors or vector sets, is always to be understood as the 
maximum according to . Large letters will usually denote sets; N will be 
the set of nonnegative integers and Z the set of signed integers. 

2.2 The Program Model 

Let us i rst insist that the present work is not about any particular language, 
but about the static subset of any programming language. To emphasize this 
fact, the examples will be written indihferently in Fortran, Pascal or C. Fur- 
thermore, the fact that a given program fragment belongs to this static subset 
may be self-evident from the program text, or may be the result of elaborate 
preprocessing (goto elimination, induction variable detection, constant prop- 
agation, do loop reconstruction, to cite a few). In this paper, we will always 
suppose that such preprocessing has already been applied and that we are 
dealing with its results. 

For simplicity, data types will be restricted to integers, reals, and n-dimen- 
sional arrays of integers and reals. Adding other scalar types (Boolean, com- 
plex numbers) and even record types is easy. The only statements we will 
consider in this section are scalar and array assignments. The only control 
constructs will be the sequence and the do loop. A do loop has the property 
that it possesses a counter, and that neither the counter nor its upper and 
lower bounds are modij ed in the loop body. In this paper, we will suppose 
that the loop step is always one. If the step is a known numerical constant, 
the program can always be transformed to have step one. If the step is an 
expression, the program will be considered to be beyond the static control 
model. 

The Pascal for loop has all of the above properties and thus can be con- 
sidered equivalent to a Fortran do loop. The C for loop is a more complex 
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object since the loop counter, lower and upper bounds are not recognized 
by the language, and since these elements can be modifled in the loop body. 
However, it is possible to check whether these restriction are adhered to, 
and thus to identify those C loops which are equivalent to a Fortran loop. 

We will also suppose that compound statements are attened, i.e. that con- 
structions such as 

begin SI; 

begin S2; S3 
end 
end 

are replaced by the equivalent: 
begin SI; S2; S3 end 

2.2.1 Restrictions. The above restrictions are obviously intended to sim- 
plify the calculation of the total number of iterations of all loops. This is, 
however, not suXcient: wc have to specify the form and content of the loop 
bounds. The simplest case is when limits are known numerical values. This, 
however, is much too restrictive, since many programs use variable limits (ma- 
trix and vector dimensions, discretization size, etc.) and even non rectangular 
loop nests: consider for instance the prevalence in numerical analysis of trian- 
gularization algorithms (like those of Gauss or Cholesky) . These observations 
motivate the following dei nition of the class of static control programs. 

To recognize a static control program, one must j rst identify its structure 
parameters: a set of integer variables which are dej ned only once in the pro- 
gram, and whose value depends only on the outside world (through an input 
statement) or on other already dej ned structure parameters. A program has 
static control if all its loops are do loops whose bounds depend only on struc- 
ture parameters, numerical constants and outer loops iteration counters. The 
analysis technique which is presented here is based on the theory of aXne 
inequalities, and hence is applicable only if all limits are aXne functions. For 
similar reasons, all subscripts are restricted to aXne functions of the loop 
counters and the structure parameters. 

We will use the fact that in a correct program, array subscripts are al- 
ways within the array bounds. Hence, two array references address the same 
memory location if and only if they are references to the same array and their 
subscripts are equal. This restriction is not too severe if we note, j rst, that 
it is good programming practice to debug a program before submitting it to 
an optimizing or restructuring compiler, and also that the methods of this 
paper may be used as a highly eXcient array access checker. 

This hypothesis will allow us to ignore array declarations. As a conse- 
quence, our technique will be equally applicable to languages which enforce 
constant array bounds Fortran, Pascal, C, ... and to those which do not 
as for instance Fortran 90. 
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2.2.2 The Sequencing Predicate. Values in array elements are produced 
by execution of statements. Hence we need a notation to pinpoint a specij c 
execution of a statement, or operation. Our j rst need is an unambiguous 
designation of a statement in a program. Our solution is to use arbitrary 
names, which will be denoted by letters such as R, S, T. When discussing 
examples, we will use the fact that our preferred languages allow the aXxing 
of a numerical label to each statement. By convention, the statement labeled i 
will be named S^. In the balance of this paper, we will mostly be interested in 
simple statements. However, some discussions will be clearer if all statements, 
compound or simple, are named. 

In our source language fragments, the only repetitive construct is the do 
loop. Hence, an operation is uniquely dcj ned by the name of the statement 
and the values of the surrounding loop counters (the iteration vector [ uc78] ) . 

A pair such as a| whose components are a statement name and an inte- 
ger vector will be called an (operation) coordinate. To denote a statement 
instance, a coordinate must satisfy two conditions: 

the dimension of a must be equal to the number of loops surrounding R: 
all components of a must be within the corresponding loop bounds. 

With each loop L we may associate a pair of inequalities: 

IbL i a I u6l' 

where a is the loop counter of L. If a statement R is embedded in a loop nest 
Li' L 2 ‘ t>l»' Ljv, in that order, then the iteration vector a of R must satisfy: 

i a[p] i (2.1) 

(2.1) may be summarized in matrix form as: 



E^a n^t> ( 2 . 2 ) 

where Er is a 2 Z' matrix and np^ is a vector of dimension in which 
the structure parameters may occur linearly. 

Formula (2.2) will be called the existence predicate of R. Notice that we 
do not suppose that i ufep. In accordance with the Pascal convention 
(and with the Emodern interpretation of Fortrando loops), a loop whose 
bounds violate this inequality will not be executed at all. 

Consider for example the program sketch in figure 2.1. Figure 2.2 describes 
its iteration domains. The existence predicate of statement S 2 may be 
written as: 
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The preceding discussion leads to a spatial description of loops. Such a 
point of view goes back to the work of uck; see also Padua and WolfeQ 
review article [PW86]. Usually, loops are explained from a temporal point of 
view: iteration i is executed just before iteration i + 1, and so on. We must 
seek a way to reconcile those two aspects. This may be done by dej ning a 
sequencing predicate on the iteration domains. The sequencing predicate is 
a strict total order on the set of operation coordinates; it is written; 

and expresses the fact that -R' a| is executed before -S' b|. The sequencing 
predicate depends only on the source program text. We have given a simple 
expression for it in [Fea91]. Let as be the number of loops which enclose 
both statements R and S. Let <1 be the textual order of the source program: 
R <1 S iKR occurs before S in the program text. The execution order is given 
by: 



d^'at\-S'bt ~ a[ltt> Rs] b[lK> <^(a[ltt> ^s] = b[ltt> ^R < S)> 

(2.3) 

nowledge of rs (a matrix of integers) and <1 (a strict total order 
relation) is all that is needed to sequence all operations in a program. 

When lexicographic order is replaced by its dej nition, the sequencing 
predicate becomes a disjunction of as -|- 1 aXne predicates which will be 
written as \p: 

\p -S'b|~ (a[lKy] = b[lKy] ^a[p+ f] < b[p+ !])' 0|p< rs> 

(2.4) 

The version for p = is : 

Tl' at -S' bt cs a[fl» = b[fl» /js] ^R <| S> (2.5) 

One may notice that operations which stand in the relation \p to each 
other have exactly p identical coordinates in their iteration vectors. In Allen 
and ennedyQ paper [A 87] , if two such operations give rise to a dependence, 
one says that this dependence is at depth p+ I, while if p = rs, the depth 
is said to be inj nite. With a slight displacement of the origin, we will say that 
\p is the sequencing predicate at depth p, depths ranging from 0 to rs- 

2.2.3 Another Presentation of the Seqnencing Predicate. We can 

derive another expression for the sequencing predicate by considering the ex- 
ecution tree of the program, which is obtained by (conceptually) unrolling all 
its loops. The nodes of the execution tree are either simple statements (the 
leaves) or compound statements (the interior nodes). A compound statement 
comes either from a genuine compound statement in the source program or 
from the unrolling of a loop. Let us number all edges issuing from a given 
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note consecutively from left to right, starting from the lower bound of the 
loop in the case of unrolling, and from 1 in the case of a compound statement. 
The coordinates of the iteration vector of a leaf are the numbers encountered 
on the unique path from the root to the leaf. If we suppose that the program 
has been normalized, i.e. that the body of a loop is always a compound state- 
ment whatever the number of statements it contains, then the coordinates 
of the iteration vector alternate between positions in compound statements 
(constants) and loop counters (variables). By convention, the whole program 
is a compound statement, hence the the j rst component of all iteration vec- 
tors is a constant. The point of this construction is now that the sequencing 
predicate is simply lexicographic order. 

Consider the program of Fig. 2.3. The iteration vectors of Si and S 2 are 
now (l,fc, 1) and (2, i, l,j, 1). From this we dednce, e.g. that all instances 
of Si execnte before all instances of S 2 . Similarly, by simplifying the lexi- 
cographic order, one can show that: 

-< (S2,i',j') = (2,i, 1, j, 1) < (2,i', 1, j', 1) 

= i < i' V (i = i' A j < f). 

The notations wo have dcj ned in the preceding section will be extended 
to deal with the new iteration vectors. For instance, the existence predicate 
of a statement S will still be written: 



HR 

where the matrix E-^ and the vector hr have new rows to deal with the con- 
stant values in the iteration vector. Similarly, we will still use a ^ b for the 
depth p sequencing predicate, the meaning being that the above expression 
begins by p equalities on the variable components of a and b. 

These new iteration vectors where introduced in [Fea92b] for other pur- 
poses. A similar proposal, with a dilfcrent numbering scheme has been made 
in [ P96]. 



2.3 Data Flow Analysis 

2.3.1 Formal Solution. Suppose that we are given a program conforming 
to the restrictions of section 2.2.1. Let T be a statement in which an array M 
is read. Statement T will be called the observation statement in what follows. 
Let b be the iteration vector of T; the subscripts of M arc aXne functions of 
b. In vector form, the reference to M may be written M[g(b)]. 

Consider for instance the reference to v[i,k] in: 

for i := 1 to n do 

for j := 1 to i-1 do 
for k := i+1 to n do 

1 : v[j,k] := v[j ,k] -v[i,k] *v[j ,i] /v[i,i] ; 
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The iteration vector of Si is (l,i, 1, 1, k, 1). The indexing function, g, is 

given by: 




1 0 
0 0 



0 

0 



0 0 0 
0 1 0 



b. 



We are interested in j nding the source of the value of M[g(b)]. Let 
S]‘>i>0'S„ be the statements which produce a value for M, and let apt>c»"a„ 
be their iteration vectors. is of the form: 



M[f,(a,)] ^ 



The source is a function of b which gives a coordinate when evaluated, 
which will be called the source function of M[g(b)]. 

For each S^, there is a set of operations which write into M[g(b)]. Let Q,;(b) 
be this set. The set of all candidate sources is: 

n 

Q(b) = Q,(b)> 

i=l 

Let us state the conditions which apply to a generic member, a of Qi(b): 

Existence Predicate: a must be a legitimate iteration vector for S,;: 

ns> (2.6) 

Subscript Equations : the subscripts of M must be the same at the read 
and write operations: 

fi(a) = g(b)> 

Note that this vector equation subsumes r scalar equations, where r is the 
rank of M. In writing this equation, wc have taken into account the fact 
that the subscripts of M are guaranteed to be within M bounds. 
Sequencing Predicate a must be executed earlier than b: 

a b> 

Environment : The observation statement must be executed: 



Epb np> 

From this we deduce the dej nition of Q^: 

ADA:direct:set 

Qi(b) = lia^Fs.a ng.'a b' f*(a) = g(b)0 (2.7) 

The sets Qi may still be subdivided according to the following observation. 
Under the restrictions of Sect. 2.2.1, the existence predicate and subscript 
equations generate a set of aXne constraints. As we have seen earlier, the 
sequencing predicate is a disjunction of aXne predicates \p. Hence, Qi is a 
union of polyhedra, or, rather, sets of integer points contained in polyhedra: 
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ADA:direct:p 

Qf(b) = ns^' a \, b' f,(a) = g(b)0 (2.8) 

X 

Q(b) = ‘ Qnb)> (2.9) 

z— 1 p— 0 

Finally, the source we are seeking is the lexicographic maximum of Q(b): 

ADA: sources 

n "S.T 

’ (b) = max QC(b)> (2-10) 

i=l p=0 

111 this paper, we will make repeated use of the following: 

distri:max 

Property 21. 

n 

max E-i — max(maxi?i)‘ 
i=l 

where the Ei are arbitrary subsets of a totally ordered set E, and where max 
is the maximum operator associated to the order relation of E. 

The proof is trivial if none of the sets Ei is empty. If not, we have to introduce 
a special symbol, representing the undej ned value, to stand in place of 
the maximum of an empty set. By convention, \is less than any other value 
in any of the sets Ei : 

^ yE : \ x> ( 2 - 11 ) 

Application of the above property to (2.10) lead to the computation of 

= f(b) = maxQAb) (2.12) 

"(b) = max max "f(b)> (2-13) 

i — \ p=0 

The quantities "f(b) are known as direct dependences and were j rst de- 
i ned by Brandes [Bra88]. 

To avoid multiple subscripts, we will renumber all possible candidates at 
all depths with a new index j. L will stand for the cardinal of the set of 
possible sources. (2.13) will be rewritten as : 



" (b) = niax^"j (b) = T L<)> 



(2.14) 
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Let us go back to the example in Figure 2.3. Consider the problem of 
finding the source of c[i+j] in statement S 2 . There are two candidates, 
Si and S 2 itself, and as a consequence, three functions , <j 2 and <j 2 . The 
vector b, in this case, has dimension 5; (2, i, l,j, 1). To simplify notations, 
only its variable components, i and j, will be taken into account. 

Consider for instance the set Q 2 (i,j). Its elements are five dimensional 
integer vectors (2,i', l,j', 1) which satisfy the following constraints: 

— the index equations, i' + j' = i + j', 

— the sequencing constraint i' < i \/ {i' = i A j' < j). One sees that the 
second term in the disjunction is incompatible with the index equation. 
This implies that Q 2 is empty and 1,2 = T. 

— the limit constraints 0 < i' < n, 0 < j' < n. 

Examination of hgure 2.4 shows that Q 2 {i,j) is empty if i = 0 or j = n. 
If not empty, its lexical maximum is the vector (2,i — 1, 1, j + 1, 1). This 
implies that to represent > we need a conditional: 



t2(b j) = if (i > 1 A j < n)then (2,i — 1, l,j + 1, 1) else ±. (2.15) 

The case of the other candidate is simpler; we always have: 

■?! = (1,* + j, !>• 

Computing the lexicographic maximum of these values is now a straight- 
forward exercise in algebra. The result is: 

?(i, j) = if (i > 1 Aj < n) then (2, i — 1, 1, j-|-l, 1) else (1, i+j, 1). (2.16) 

To obtain this result, we have relied a lot on figure 2.4 and geometrical 
intuition. Now this works fine on one- and two-dimensional problems, but 
is quite difficult and error prone in three dimensions, and is impossible 
beyond. Furthermore, a computer has no geometrical intuition at all. Our 
aim now will be to solve the above problem in a general, systematic fashion 
and to implement the corresponding algorithm. 

2.3.2 Evaluation Techniques. 

Direct Dependences. In this section, we will focus j rst on one particular direct 
dependence at a given depth p. When the original program conforms to the 
restrictions of section 2.2.1, all terms in formula (2.12) are linear equalities 
or inequalities. In fact since indexing functions are aXne, the j rst term is a 
linear system whose dimension is the rank of array M. The last term is simply 
a set of linear inequalities. The second term is given by (2.4) or (2.5). If the 
depth p is less than then it is the conjunction of p equalities and one 

inequality. For p = sp, it is made of equalities only and does not exist if 
Sj <3 T is false. 

As a consequence, (t>) is the set of integer vectors which lie inside a 
polyhedron. Finding its lexical maximum is a Parametric Integer Program (a 
PIP) [Fea88b]. A short description of an algorithm for solving PIP problems 
is given in the appendix. The parameters are the components of b and the 
structure parameters. Note that the components of b are not arbitrary; they 
must satisfy various constraints, among which is: 
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Fig. 2.4. Computing the source function for the program of Figure 2.3 

The problem is j nding the source of c [3] at iteration (2= 1) and of c [6] at 
iteration (2' 4) (circled points). 

Square boxes enclose the corresponding Q sets. 
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-©jb nj 

to which may be added any available information on the structure parameters. 
These inequalities form the context of the parametric integer problem. 

To express the solution, we need the concept of a quasi-aXne form. Such 
a form is constructed from the parameters and integer constants by the op- 
erations of addition, multiplication by an integer, and uclidean division by 
an integer. The solution is then expressed as a multistage conditional expres- 
sion. The predicates are of the form /(b) 0, where / is quasi-aXne. The 

leaves arc vector of quasi-aXne forms or the Eundej ned sign,\. Such an 
expression will be called a quasi-aXne selection tree (quast for brevity). 

The result of this analysis is the direct dependence at depth p between 
the dej nition by and the use in T. The presence of a \ sign in a direct 
dependence indicates that, for some values of the loop counters, the reference 
in T is not dej ned by statement S*. 

Formula (2.15) is a quast in the above sense (notice that integer division is 
not used here). Integer division appears when analyzing programs which 
access arrays with strides greater than one, as in; 

s = 0. 

do i = 1, 2*n, 2 

1 x(i) = 1 
end do 

do k = 1 , 2*n 

2 s = s + x(k) 
end do 

The direct dependence from x[2*i-l] in Si to x[k] in S 2 is given by the 
following quast: 

Cl (fc) = (1, if 2((fc -|- 1) -y 2) — (fc -|- 1) > Othen (fc -|- 1) -r 2else T, 1). 

This formula expresses the fact that x [k] is not defined when k is even. 

Combining the direct dependences. Consider now the problem of evaluating 
(2.14). This will be done in a sequential manner, by introducing: 



In = max^’j 4? = ttC* 

lo = \> 

Obviously, ’ (b) = 1 n and we have the recurrence: 

= niax(i„ 0 i"„)> (2.17) 

We are thus led to the evaluation of rnax(i' 2 ) where i' 2 are arbitrary qnasts. 
We will use the term extended quast for any formula constructed from \and 
quasi-linear vectors by the operations of selection (if . . . then . . . else 
. . . ) and taking a maximum. 
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Our problem is then to remove the maximum operator from an extended 
quast. This is done with the help of the following rules (and their symmetrical 
counterparts, as the max operator is commutative). 

regle:a 

Rule 1. max(\‘ i) = 1 . (This is simply a restatement of (2.11).) 

regie: b 

Rule 2. // 1 = if O' then ii else I 2 , then: 

max(i‘ 2 ) = if (7 then max(i 1 ' 2 ) else max(i 2 ‘ 2 ) 

regle:c 

Rule 3. If u and V are quasi linear vectors then 

max(ti'u) = if u uthen uelse u> 

The context of a node in a quast, C, is the conjunction of all the predicates 
which are asserted to be true as one follows the path from the root of the 
quast to the distinguished node. C is constructed by Banding p if the leaf 
is in the true part of a conditional if pthen Ka>, and by anding (xp if it is in 
the false part. 

regle:simpli 

Rule 4. Let if pthen 1 else 2 be a subtree of a quast, and let C be its con- 
text. Then ifC^p is not feasible, replace the subtree by 2 . Similarly, ifC^oqp 
is not feasible, replace the subtree by 1 . 

regie: u 

Rule 5. if C then 1 else 1 = 1 . 

Theorem 21. If Rules 1 to 5 are oriented from left to right and used as 
rewrite rules, then their application to any extended quast always terminates. 

Proof. Let us introduce the following metrics: 

The size of an extended quast, <l|k<j|kis the number of nodes in the tree 
representation of 1 . It is given by the following recursive dej nition: 

1. 1, where ti is a quasi linear form. 

2. 4f pthen 1 else 2 Jli= 1 + 

3. «l|knax(i‘ 2)4»= 1 + 

The height of a max operator is simply the sum of the sizes of its arguments: 

/i(max(i' 2 )) = 

is the number of if Q in an extended quast. 
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Rules 1 to 3 have the property that the max operator on the left has greater 
height than the (eventual) max operators on the right. In the case of 2, for 
instance, we have: 



/i(max(if C then iielse 1-2^2)) 
/i(max(i 1 ' 2 )) 
/i(max(i 2 ‘ 2 )) 



1 + 

♦ 2 «^- ♦♦ 



If there are further max operators inside 2 , for instance, their height is left 
undisturbed by application of the rules. All other rules may only remove 
some max operators, without changing the height of those which are left 
undisturbed. 

Finally, the efect of rules 4 and 5 is to remove some if operators. From 
these results we deduce that as the reduction of an extended quast proceeds, 
the maximum height of the max operators stays bounded by the maximum 
height in the original quast, H. Let us associate to each quast 1 in the 
reduction a vector 6{i) of dimension H + 1, whose H | rst components are the 
histogram of Enax heights in reverse order, the last component beingl|k«IJf . 
The i rst component of 0(i) is the number of maxQ of maximum height, H. 
From the above discussion, we sec that the eK;ct of rules I to 3 is to decrease 
by one some component, i, of 0{i). In the case, of rule 2, two components 
of index fk > i are increased by one. Rules 4 and 5 may have the effect 
of decreasing some components of 0(i) (if there are max operators in the 
discarded argument), and also to decreases by at least I the last component. 
The conclusion is that for all elementary reduction steps 1 00 2 , we have 
0 ( 2 ) ^( 1 ) in lexicographic order. Since lexicographic order on positive 

integer vectors is well founded, the reduction process must eventually stops, 

D. Furthermore, as long as there is a max operator in the reduct, one of 
the rules 1 to 3 can be applied. Hence, when the reduction stops, there are 
no max operators in the result. 

In contrast to this result, it can be shown by counterexample that our 
rewriting system is not con uent, i.e. that the same extended quast can be 
reduced to several distinct quasts. However, since all rules are semantical 
equalities, it follows that all such rcducts arc semantically equal. 

In the case of (2.16), we have to compute: 

( if i>lAj<n 

a = max(T,max (< then (2,i - 1, l,j + 1, 1) , (1,* + j, I)))- 
I else T 

We have successively: 

( if i>lAj<n 

a = max( < then (2,i - 1, l,j + 1, 1) , (1,* +j, 1))) 

I else T 
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by rule 1 , then 



rifi>lAj<n 

(T = <1 then max((2,i - 1, l,j + 1, 1), (1,* + j, 1>) • 
t else max(±, ( 1 , i + j, 1 )) 

For the application of rule 3, we notice that (2,i — 1, l>i + l: 1) ^ 1) 

is always false. Use of this property is an example of rule 4. In the other 
arm of the conditional, rule 1 is applied again, giving the final result: 

(7 = if (i > 1 A j < n)then (2,i — 1, l,j + 1, 1) else (l,i + j, 1). 



2.4 Summary of the Algorithm 

Suppose that a compiler or another program processor has need to j nd the 
source of a reference to an array or scalar M in a statement S. The j rst step 
is to construct the candidate list, which comprises all statements R which 
modify M at all depths 0 | p [ If a standard dependence analysis 

is available, this list can be shortened by eliminating empty candidate sets, 
which correspond to non existent dependences. 

The ordering of the candidate set is a very important factor for the com- 
plexity of the method. xperience has shown that the best compromise is to 
list the candidates in order of decreasing depth. For equal depth candidates, 
it is best to follow the textual order backward, starting from the observation 
statement, up to the beginning of the program, and then to loop back to the 
end of the program. 

Similarly, if rule 4 is used too sparingly, the resulting quasts will have 
many dead branches, thus increasing the coirrplexity of the j nal result. Con- 
versely, if used too often, it will result in many unsuccessful attempts at 
simplij cation, also increasing the complexity. A good compromise is the fol- 
lowing: 

When computing a step of the recurrence (2.17) we will always suppose 
that rule 4 has been applied exhaustively to i„ 0 i. 

In the evaluation of i„, rule 2 should be applied by priority on the left 
argument. As long as reductions are still possible on in^i, there is no need 
to apply rule 4. All contexts that can be constructed here are feasible, be- 
cause they either come from in®i or The j rst quast has been simplii ed 
in the previous step, aird the second one comes from PIP, which does not 
generate dead branches. 

As sooir as the application of rules 2 or 3 to =„ starts, simplij cation by rule 
4 should be attempted. 

As a last remark, one can show (see Sect. 3.3.3 of [Fea91]) that the com- 
plete knowledge of the iteration vector is not needed when applying the max 
operator to sources of differing depths. In this way, one can predict before- 
hand whether a direct dependence can have in uence on the source or not, 
and avoid computing it in the latter case. 
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If these rules are followed, the results of array data ow analysis are sur- 
prisingly simple. A limited statistical analysis in [Fea9f] shows that the mean 
number of leaves per source is about two. The probable reason is that good 
programmers do a kind of data ow analysis En their head to convince them- 
selves that their program is correct. If the result is too complicated, they 
decide that the program is not well written and start again. 

2.5 Related Work 

Another approach to Array Data ow Analysis has been proposed by Pugh 
and his associates (see e.g. [PW93, Won95]). The approach consists in revert- 
ing to the basic dej nition of the maximum of a set. u is the maximum of a 
totally ordered set Q iK 

u /Q ^ocEv /Q : u \vt> 

Let us consider the dej nition (2.12) of a set of candidate sources. According 
to the above dej nition, its maximum, "i(b) is dej ned by: 

Bill 

’i(b) ,/Qj(b) ^ocGc : =i(b) \c \b ^ ,/Qj(b)> (2.18) 

In words, "i(b) is the direct dependence from S; to b iK’;(b) is in ow 
dependence to b and if there is no other operation in ow dependence tob 
which is executed between ’;(b) and b. 

The formula (2.18) is written in a subset of Presburger logic (that part of 
i rst order logic which deals with the theory of addition of positive numbers), 
which is known to be decidable. Pugh has devised an algorithm, the Omega 
test [Pug91] which is able to simplify formulas such as (2.18). The result is a 
relation between the source ’i(b) and b. It has been checked that this relation 
is equivalent to the quast which is found by our method. 

Some authors [MAL93, HT94] have devised fast methods for handling 
particular cases of the source computation. The idea is to solve the set of 
equations in the dej nition of Qf(b) by any integer linear solver (e.g. by 
constructing the Hermite normal form of the equation matrix). Suppose that 
this system uniquely determines a as a function of b: a = /(b). It remains 
only to substitute /(b) for a in the inequalities. The result is the existence 
condition for the solution, which is /(b) if this condition is satisj ed, and 
\ if not. One must revert to the general algorithm if there are not enough 
equations to j x the value of the maximum. 



3. Approximate Array Dataflow Analysis 

To go beyond the static control model, one has to handle while loops, arbi- 
trary subscripts and tests, and, most important, modular programming (sub- 
routines and function calls). Let us j rst introduce the following convention. 
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Constructs occurring in the control statement of a program (do loop bounds, 
while loops and tests predicates, subscripts) will be classij ed as tractable 
and intractable according to their complexity. AXne constructs are always 
tractable, while the dej nition of intractable constructs is somewhat subjec- 
tive and may depends on the analysis tools which are available at any given 
time. Tractable predicates in tests can always be replaced by restrictions on 
the iteration domain of the surrounding loop nest. Similarly, a tractable pred- 
icate in a while loop indicates that the loop can be transformed into a do 
loop. Wc will suppose that such simplij cations have been applied before the 
approximate analysis starts. 

In this section we will be interested in while loops and tests. Non lin- 
ear subscripts can be handled in the same framework, but they need rather 
complicated notations. The reader is referred to [BCF97] for this extension. 

As a matter of convenience, we will suppose here that while loops have 
an explicit loop counter, according to the PL/I convention: 

do c = 1 while p(...) 

The while loop counter may even be used in array subscripts. 

When constructing iteration vectors, tests branches are to be considered 
as new nodes being numbered 1 and 2. In accordance with our conventions 
for static control programs, these nodes always are compound statements, 
whatever the number of their components. For instance, in example 3 below, 
the iteration vector of Statement S 2 is x‘ F 2' 1/. The j rst 1 is the index 
of the do loop in the whole program, and the second one is the index of the 
test in the do loop body. The 2 indicates that the subject statement is in the 
false part of the test. 

With these conventions, we can transpose to this new program model 
most of the notations we introduced for static control programs. Iteration 
vectors may include while loop counters and the dej nition of the sequencing 
predicate does not change. 



ADA:FADA 



3.1 From ADA to FADA 

As soon as we extend our program model to include conditionals, while loops, 
and do loops with intractable bounds, the set Q/ of (2.8) is no longer tractable 
at compile time. The reason is that condition (2.6) may contain intractable 
terms. One possibility is to ignore them. In this way, (2.6) is replaced by: 

FADA:linear:exists 

E^a (3.1) 

where E and n are similar to E and n in (2.6) with the intractable parts 
omitted. We may obtain approximate sets of candidate sources: 
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FADA:approx:candidates 

Q?(b) = ng,' a \, b' f,(a) = g(b)0> (3.2) 

However, we can no longer say that the direct dependence is given by the 
lexicographic inaximum of this set, since the result may precisely be one of 
the candidates which is excluded by the nonlinear part of of the iteration 
domain of S. One solution is to take all of Qi(b) as an approximation to 
the direct dependence. If we do that, and with the exception of very special 
cases, computing the maximum of approximate direct dependences has no 
meaning, and the best we can do is to use their union as an approximation. 
Can we do better than that? Let us consider some examples. 

program El 

do X = 1 while . . . 

1 s = . . . 
end do 

2 s = . . . 

3 ...=... s... 
end 

What is the source of s in Statement S3? There are two possibilities. State- 
ments Si and S2. In the case of S2, everything is linear, and the source is 
exactly Things are more complicated for Si, since we have no idea of 
the iteration count of the while loop. We may, however, give a name to this 
count, say , and write the set of candidates as: 

Q5 = X' It # i a: i <> 

We may then compute the maximum of this set, which is 
"5^ = if >0then -f' Tjelse \> 

The last step is to take the lexicographic maximum of this result and -S^t, 
which is simply -3t- This is much more precise than the union of all possible 
sources. The trick here has been to give a name to an unknown quantity, , 
and to solve the problem with as a parameter. It so happens here that 
disappears in the solution, giving an exact result. 

Consider now: 

program E2 

do X = 1 while . . . 

1 s(x) = . . . 
end do 

do k = 1 , n 

2 ... = ... s(k) .. . 
end do 

end 
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With the same notations as above, the set of candidates for the source of 
s (k) in S3 is: 

Qi(^) = It# i a: i 'x=fcO> 

The direct dependence is to be computed in the environment 1 | A; | n 
which gives: if A; | then -f' A;' Itelse \. Here, the unknown parameter 
has not disappeared. The best we can do is to say that we have a source set, 
or a fuzzy source, which is obtained by taking the union of the two arms of 
the conditional: 

" {k) v/1-T' k‘ It' \0 

quivalently, by introducing a new notation, (b) for the source set at iter- 
ation b, this can be written: 



The fact that in the presence of of intractable constructs, the results are 
no longer sources but sets of possible sources justij es the name Fuzzy ADA 
which has been given to the method. FADA gives exact results (and reverts 
to ADA) when the source sets are singletons. 

Our last example is slightly more complicated: we assume that n 1, 



1 

2 

3 

4 

5 



6 

7 



8 



prograjn E3; 
begin 

for X := 1 to n do 
begin 

if ... then 
begin 

s : = ... 
end 
else 
begin 

s : = ... 
end 
end; 

... : = s .... 
end 



What is the source of s in Statement Sg? We may build an approximate 
candidate set from S 5 and another one from S 7 . Since both are approximate, 
we cannot do anything beside taking their union, and the result is wildly 
inaccurate. 

Another possibility is to partition the set of candidates according to the 
value X of the loop counter. Let us introduce a new Boolean function b(x) 
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which represents the outcome of the test at iteration x. The x-th candidate 
may be written^: 

2 (x) = if b(x) then x‘ 1‘ 1‘ 1| else x‘ 1‘ 2‘ lt> 

We then have to compute the maximum of all these candidates (this is an 
application of Property 21). It is an easy matter to prove that: 

x < X 3 2(x) \2(x )> 

Hence the source is 2 (n). Since we have no idea of the value of b{n), we are 
lead again to the introduction of a fuzzy source: 

, = 1' n' P It' P n' 2' (3-3) 

Here again, notice the far greater precision we have been able to achieve. How- 
ever, the technique we have used here is not easily generalized. Another way of 
obtaining the same result is the following. Let L = #. i a; | n<D Observe 

that the candidate set from Si (resp. S 2 ) can be written ^-^Px' 1' 1' Itfr 
Di ALO(resp. P2' 1| ,/D 2 ALO) where 

Di = Kx 4fc(x) = true Oand D 2 = 1|x #'(x) = false<|)> 



Obviously, 

and 

We have to compute 



Di AD2 = V 



Di / D2 — 



(3.4) 

(3.5) 



= max(maxDi AL'maxD 2 AL)> 

Using property 21 in reverse, (3.5) implies: 

= maxL> (3.6) 

By (3.4) we know that belongs cither to Di or D 2 which gives again the 
result (3.3). 

To summarize these observations, our method will be to give new names 
(or parameters) to the result of maxima calculations in the presence of non- 
linear terms. These parameters are not arbitrary. The sets they belong to 
the parameters domains are in relation to each others, as for instance 
(3. 4-3. 5). These relations can be found simply by examination of the syn- 
tactic structure of the program, or by more sophisticated techniques. From 
these relations between the parameter domains follow relations between the 
parameters, like (3.6), which can then be used to simplify the resulting fuzzy 
sources. In some cases, these relations may be so precise as to reduce the 
fuzzy source to a singleton, thus giving an exact result. 

^ Observe that the ordinals in the following formula do not correspond to the 
statement labels in the source program. These labels have been introduced for 
later use (see Sect. 3.3). 
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3.2 Introducing Parameters 

In the general case, any statement in the program is surrounded by tests 
and loops, some of which are tractable and some are not. Tractable tests and 
loops give the linear part of the existence predicate, dej nition (3.1) above. 
To the non tractable parts we may associate a set d,,- such that operation a 
exists iK 



FADA:exists 

E^ a ng =>a (3-7) 

The observation which allows us to increase the precision of FADA is that 
in many cases di has the following property: 

FADA:depth 

a[lttpi] = 9 (a ~ b yd^) (3.8) 

for a Pi which is less than the depth of S, . This is due to the fact that loops and 
tests predicates cannot take into account variables which are not dej ned at 
the point they are evaluated, as is the case for inner loop counters. Usually, 
Pi is the number of (while and do) loops surrounding the innermost non 
tractable construction around Sj. This depth may be less than this number, 
in case the intractable predicate does not depend on some variables, but this 
can be recognized only by a semantics analysis which is beyond the scope of 
this paper. 

A cylinder is a set C of integer vectors such that there exists an integer 
p the cylinder depth with the property: 

FADA:cylinder 

a yC ^a[lttp] = b[lttp] 9 b yC> (3.9) 

The depth of cylinder C will be written (U). 

The above discussion shows that to each di we may associate a cylinder 
C'i by the dej nition: 



a yCi ~ eb ydi : a[lttpi] = b[lttpi]' 



with the property: 

Ug a Ug =>a ~ Ug a Ug =>a 

The depth of C'i is bounded upward by the number of loops surrounding S^; 
a more precise analysis may show that it has a lower value. 

With these convention, the set of candidate sources becomes: 

FADA:candidates 

Q)’(b) = ta *Bg a ng.' a a b' F(a) = g(b)0> (3.10) 
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Let us introduce the following sets: 

FADA: pa ram: candidates 

Qf(b' ) = ta*Bs.a ng.'a[lK^i]= ' a \, b' fj(a) = g(b)0> (3.11) 

) is the intersection of Qf(b) with the hyperplane a[lffipj] = . (3.10) 

can be rewritten: 

FADA:union 

Qnb)= Qf(b^ )> (3.12) 

^ Ci 

Another use of property 21 gives: 

"f(b) = maxQ^(b) = niax(inaxQ)’(b' )) (3.13) 

OL Ci 

Now Qf(b' ) is a polyhedron, as is evident from (3.11). Hence its lexico- 
graphic maximum, 

FA DA: direct 

yf(b' )=maxQ^(b' ) (3.14) 

can be computed by just another application of PIP. In fact, the presence 

of the additional inequalities a[lffipi] = may simplify this calculation. We 
then have: 

= f(b) = max=f(b' )> (3.15) 

OL Ci 

The maximum in the above formula is reached at some point of Ci. This 
point is a function of A p and b, written as (b) and is known as one of the 
parameters of the maximum of the program. The direct dependence is now 
given by: 

FADA: para m:direct 

^"(b)==f(b^ f(b))> (3.16) 

At this point, wc can go on as wc did in the case of exact analysis: 

Compute all parametric direct dependences by (3.14). 

Combine the direct dependences by rules 1 to 5. 

In the end result, quantify over all possible values of the parameters, so as 
to get source sets. 

This procedure does not give precise results, since we lose all information 
about relations between parameters of the maximum. Our aim now is to 
explain how to j nd these relations and how to use them to narrow the j nal 
result. 
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3.3 Taking Properties of Parameters into Account 

The sets Ci, Cj for i ^ j may be interrelated, depending on the position of 
statements S^, Sj in the abstract syntax tree. An example of this situation 
has been observed for statements S 5 and S 7 of program E3. These relations 
induce relations between the corresponding parameters, which have to be 
taken into account when combining direct dependences. The relations on the 
Ci sets may have several origins. The most obvious ones are associated to 
the structure of the source program, as in the case of E3. It may be that 
other relations are valid, due for instance to the equality of two expressions. 
Here again, this situation can be detected only by semantics analysis and is 
outside the scope of this paper. 

The structural relations among the Ci can be found by the following 
algorithm; 

The outermost construction of the source program (by our convention, 
a compound statement), is associated to the unique zero-depth cylinder, 
which includes all integer vectors of any length, and can be written as ZT 
If Co is associated to: 

begin SI; .... Sn end 

then Ci = Cq. 

If Co is associated to: 

if p then SI else S2 

where p is intractable, then the cylinders associated to Si, Ci and S 2 , C 2 
have the same depth as Co and are such that: 

Cl AC2 = V / C*2 = Co> 

If p is tractable, Co = Ci = C 2 . 

If Co is associated to a for: 

for do SI 

or to a while: 

do ... while ... SI 

and if these loops are intractable, then the cylinder Ci associated to Si has 
depth (Co) + 1 and is such that: 



Cl V Co> 



Otherwise, Ci = Co. 
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The relation between ADA and FAD A. In the case where all enclosing state- 
ments of an assignment are tractable, it is easy to prove that C'i = ZA The 
condition a y'C'i is trivially satisj ed in (3.10). Hence, in that case, FADA de- 
faults to ADA. Provided this case is detected soon enough, one and the same 
algorithm can be used for all programs, precision of the results depending 
only on the presence or absence of intractable control constructs. 
Characterization of the Parameters of the Maximum. The main observation 
is that each parameters is itself a maximum. Note j rst that from (3.11) 
follows: 

f(b) = =f(b= f(b))[l®p,]> 

Suppose now that Q is an arbitrary set of vectors all of which have dimension 
at least equal to p. Let us set: 

Q4 = l'a;[l®p] /Q0> 

The properties of the lexicographic order insure that: 



(max Q) [Ittp] = max Q ^ > 

In our case, this gives: 



Beta max 



f(b) ^ =f(b^ r(b))[l«?^.] 

= (maxQ(’(b' )'(b)))[lttpi] 

= rnaxQ)’(b' nb))4>, 

= niax(C7, AQP(b)4J (3.17) 

where Qf(b) is the E)aolyhedral envelope of all possible sources at depthp 
(see (3.2)). This formula fully characterizes the parameters of the maximum 
and will be used repeatedly to obtain relations between them. 

Another set of relations is given by depth considerations. Note that from 
(3.11) follows: 

= f(b^ T(b))M= r(b)' 

and 

’f(b' )’(b))[lKy] = b[lK^]' 

provided the set Q(Xb‘ ^b)) is not empty. Now, in (3.17), we can exclude 
the case where Qf (b) is empty, since this can be decided a priori by integer 
linear programming. If such is the case, statement S,j is simply excluded from 
the list of candidates at depth p. Hence, either (7,; is empty, in which case, 
by (3.17), we set (^(b) = \ or else the above relations apply. Let us set 
nii = min(p'Pi). We obtain: 



)'(b)[lttmi] = b[lttmi] <;= ('(b) = \p 



FADA:exact 

(3.18) 
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Exact Cases of FAD A. Among the Ci there is the set corresponding to the 
observation statement, Since our convention is that the observation state- 
ment is executed, we have b hence C^is not empty. It may happen 

that the results of structural analysis imply that Ci = Suppose that 
P Pi = P— From (3.18) we deduce: 

f = b[lKyi]> 

This allows us to remove the nonlinear condition a C^Ci from (3.10) before 
computing its maximum. 

In the case where the innermost intractable statement is a while or a do 
loop, we can go a step further since Cj now has the property: 

b ^Ci =>(a[lK?3i 0 1] = b[l®pi 0 1] ^a[pi] < b[pi]) 9 a 
This means that the exactness condition is in that case: 



p Pi 0 1> 

This enable us to solve exactly such problems as the source of s in: 

do c = 1 while . . . 

1 s = s + . . . . 

end do 

Here the candidate and observation statements are both Si. pi = 1 and p = 0. 
The exactness condition is satis| ed, and the source is: 

" 1 (c) = max^-1-' c " It #1 J, c " c < c<> 

= if c > 1 then -t' c 0 T Ijelse \> 

From Parameter Domains to Parameters of the Maximum. It remains to 
study the case where the structural analysis algorithm has given non triv- 
ial relations between parameters domains. The associated relations between 
parameters can be deduced from (3.17) by Prop. 1 and the following trivial 
properties: 



empty:max 

Property 31. If C AD — y then: 

(maxC = \=t-maxH = \) <l=maxC' i= max Ho 

inclus:max 



Property 32. 11 C M D then: 

max C max Dt> 

As a consequence, since C AD V C, we have 

max(C AD) max (7' 

and the symmetrical relation. 



inter:max 

(3.19) 
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Example E3 revisited. The observation statement is Sg. It is enclosed in no 
loops. Hence, the b vector is of zero length, and will be omitted in the sequel. 
There are two candidate sources, S5 and S7, whose iteration vectors are of 
length one and will be denoted as x. It that case, lexicographic order defaults 
to the standard order among integers. 

The parametric sources are: 

)= max^x # i X [ mx = ()= if 1 | | nthen 1' Ijelse \> 

= y( )=max^x4ii xl mx= if 1 | | nthen -f' l‘2'ltelse\> 

The structural analysis algorithm gives the following relations: 

Co ' Clg = Co 

C2 = Cl = C, = C2 

C4/C6 = c, 

C 4 AC 6 = V 

C5 = C4 ' C7 = Ce> 

Here, only C5 and C7 are interesting. Remembering that Co = ZC all other 
sets can be eliminated, giving: 

C5 / C7 = C5 AC7 = 

The depths po and pj are both equal to 1. From qu. (3.17) we deduce: 

5 = max(C5 AQ5) = max(C5 A[l'n])> 



Similarly, 



7 = max(Cs A[Tn])> 



From the above relations, we deduce: 



(C5 A[Tn]) A(C 7 A[Tn]) = V 

and 

(C5 A[l' n]) / (C7 A[l' n]) = Z^A[1' n] = [1' n]i> 

This equality can be interpreted as two inclusions from left to right, giving 
by Prop. 32: 

5 i 7 I 

or as an inclusion from right to left, giving: 

n i max( °)> 

Lastly, we deduce from the j rst relation that: 
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Suppose now that the maximum of g and ° is It is easily seen that 
this implies: 

5 = n" 7 < nt> 

In the reverse situation, the conclusion is: 

7 = ri' 5 < 



Hence, the j nal source is given by: 



if 5 = n ^ 



= \ then max("g 



< n 

1) 



5 ( 5 — 7 

else max(=|’( 5 )"f( 5 ))<i 1 " = n' " < n<> 



0 _ 



< nO 



where the notatian (/<l^C''C> indicates that the quast q is to be evaluated by 
rules 1 to 5 in tfie context C. We leave it to the reader to verify that the 
result is: 



’=if 5 =n^ 7 <n then n" D D 1| else n' D 2' 1|> 



3.4 Eliminating Parameters 

The result of the above computation can be considered as a parametric rep- 
resentation of the fuzzy source: as the parameters take all possible values, the 
result visits all possible sources. In some cases, this is exactly what is needed 
for further analysis. In most case, however, more compact representations are 
enough. This can be obtained by the following process. 

Let 1 ( ) be a leaf of the fuzzy source, where symbolizes all parameters 
occurring in the leaf. Parameter elimination uses the two rules: 

regle:q 

Rule 6. A leaf i ( ) in context C is replaced by the set: 

1|i( )* 

Note that after application of this rule, the variables of become bound and 
no longer occur in the result. 

regle:union 

Rule 7. A conditional if p{ ) then A else B where A and B are sets which 
do not depend on is replaced by A / B. 

Application of these rules to the result of the analysis of E3 gives the fuzzy 
source: 

, = n' T T It' n' T 2' ltO> 

Observe that rules 6 and 7 are consistent with rule 4. If the context of a 
leaf is unfeasible, the leaf can be removed by rule 4. It can also be transformed 
into the empty set by rule 6, and it will then disappear at the next application 
of rule 7. 
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3.5 Related Work 

3.5.1 Pugh and Wonnacott s Method. Pugh and Wonnacott [Won95] 
have extended the Omega calculator for handling uninterpreted functions in 
logical formulas. This allows them to formulate problems of data ow analysis 
in the presence of intractable constructs. They simply introduce a function 
to represent the value of the construct as a function of the surrounding loop 
counters. These functions may be used to represent the number of iteration 
of a while loop (see in the analysis of example El in Sect. 3.1) or the 

outcome of a test (see b for example E3 in the same section). When we say 
that a construct has depth Pi, it means that the corresponding function has 
as arguments the pi outermost loop counters. 

The problem with this approach is that adding one uninterpreted function 
to Presburger logic renders it undecidable. Hence, Pugh and Wonnacott have 
to enforce restrictions to stays within the limits of decidability. They have 
chosen to partition the variables in a logical formula into input and output 
variables, and to use only uninterpreted functions which depends either on 
the input or output variables but not both. Applying a function to anything 
else (e.g. a bound variable inside an inner quantij er) is forbidden and is 
replaced by the uninterpreted symbol unknown. This restriction is rather ad 
hoc and it is diXcult to assert its ehhet on the power of Pugh and WonnacottQ 
system. In fact, we know of several examples which they cannot handle but 
which can be solved by FADA: E3 is a case in point. In the case of FADA, D. 
Barthou et. al. have proved in [BCF97] that their system of relations between 
parameters of the maximum is correct and complete, i.e. that no potential 
source is missed, and that each element of a source set can be a source for 
some realization of the intractable predicates. 

On the other hand, Pugh and Wonnacott have included some semanti- 
cal knowledge in their system. When assigning functions to intractable con- 
structs, they identify cases in which two constructs are equal and assign them 
the same function. This is easily done by j rst converting the source program 
in Static Single Assignment (SSA) form. In SSA form, syntactically identical 
expressions are semantically equal. The detection of equal expression is lim- 
ited to one basic bloc. This method allows them to handle examples such as: 

program E4; 
begin 

for i := 1 to n do 

begin 

if p(i) >= 0 then 

1 : s := . . . ; 

if p(i) < 0 then 

2 : s := ...; 

end; 

... := s . . . ; 

end 
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in which the key to the solution is recognizing that p(i) has the same value 
in the two tests. We could have introduced a similar device in FADA; the 
result of the analysis could have been translated in term of the Ci sets (here, 
we would have got the same relations as in the case of E3) and the analysis 
would have then proceeded as above. We have chosen to handle j rst the 
semantical part of FADA. Recognizing equal and related expressions is left 
for future work, and we intend to do it with more powerful devices than SSA 
conversion (see [BCF97]). 

3.5.2 Abstract Interpretation. As is well known, in denotational seman- 
tics, the aim is to build the input/output function of a program, which gives 
the i nal state of the computer memory in term of its initial state. This func- 
tion is built in term of simpler functions, which give the ehfect of each state- 
ment and the value of each expression. These functions in turn are obtained 
by applying compilation functions to abstractions of the program text. The 
dej nitions of the compilation functions are constructive enough to enable 
a suitable interpreter to execute them. As many researchers have observed, 
these function dej nitions are quite similar to ML programs. 

The basic idea of abstract interpretation [CC77] is to dej nc other, non 
standard semantical functions. Obviously, this is interesting only if a non- 
standard semantics can be considered in some sense as an approximation of 
standard semantics. This is formalized using the concept of Galois connection 
between the domains of the abstract and standard semantics. 

An example of the use of these ideas for analysis of array accesses is found 
in [CI96]. In this work, the results of the analysis are regions, i.e. subsets of 
arrays as dej ned by constraints on their subscripts [TIF86] . Several types of 
regions are dej ned. For instance, the write region of a statement is the set 
of array cells which may be modij ed when the statement is executed. The IN 
region is the set of array cells whose contents are used in a calculation. 

When designing such an analysis, one has to select a j nite representa- 
tion for regions. In the quoted work, regions are convex polyhedra in the 
subscript space. Less precise representations have been suggested, see for in- 
stance [GLL95] for the concept of regular sections. In the same way as the 
standard semantics has operators to act on arrays, the nonstandard seman- 
tics must have operators to act on regions. These operators are intersection, 
union, subtraction and projection (which is used for computing the efect 
of loops). Depending on the representation chosen, these operators may be 
closed or not. For instance, intersections of polyhedra are polyhedra, but 
unions are not. In case of unclosed operators, one has to de; ned a closed 
approximation: for the union of polyhedra, one takes usually the convex hull 
of the real union. 

One sees that there are two sources of approximation in region analysis. 
One comes from the choice of region representation. For instance, convex 
polyhedra are more precise than regular sections, but are not precise enough 
to represent frequently occurring patterns, like: 
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do i = l,n 
m(2*i-l) = 0. 
end do 

The corresponding write region, in [CI96] notation, is )‘ 1 i i 2n0lt, 
which is only an approximation of the exact region, -a( )' = 2 J (g) 1' 1 J, 

J i n|. 

The second source of approximation is the same as the one in FAD A: the 
source program may contain intractable constructs. Approximate regions are 
constructed by ignoring intractable terms, in the spirit of (3.2). 

ADA and FADA represent their results not as convex polyhedra but as 
i nite unions of Z-polyhedra (the intersection of a polyhedron and a Z-module, 
see the appendix). This representation is inherently more precise and has 
enough power to represent exactly all regions occurring in the analysis of 
static control programs. An interesting open problem is the following: is it 
possible to reformulate the method of [CI96] in term of unions of Z-polyhedra, 
and. if so, would the results be more or less precise than FADA? 



4. Analysis of Complex Statements 

4.1 What Is a Complex Statement 

All the preceding analyses are predicated on the hypothesis that each oper- 
ation modij es at most one memory cell. It is not diXcult to see that it can 
be easily extended to cases where an operation modij es a statically bounded 
set of memory cells. 

The situation is more complicated when the language allows the modi- 
i cation of an unbounded set of memory cells by one statement. A case in 
point is the read statement in Fortran: 

program R 
do i = l,n 

read (*,*) (a(i,j), j = l,n) 
end do 

Another example is parallel array assignments in languages like Fortran 90, 
the Perfect Club Fortran (PCF) or HPF. The simplest case is that of the 
independent do loop: 

program Z 
doall (i = l:n) 
a(i) = 0.0 
end doall 

Program Z is in PCF notation, while ZV is in Fortan 90 vector notation. 



program ZV 
a(l:n) = 0.0 
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How are we to handle such idioms in Array Data ow Analysis? Let us 
recall the dej nition (2.7) of the set of candidate sources: 

Q*(b) = ng.'a b' fi(a) = g(b)0 

The i rst problem for a complex statement is that a does no longer char- 
acterize the values which are created when executing operation a. We have 
to introduce auxiliary or inner variables to identify each value. In the case 
of program R, for instance, this new variable is in evidence; it is simply the 
Hmplicit do loop counter j. The same is true for program Z. In the case of 
program ZV, a new counter has to be introduced, let us call it . 

We next have to decide what constraints must be satisj ed by these in- 
ner variables. For the three examples above, these are in evidence from the 
program text: 

i i j i n 

for R, and: 

1 i in 

for ZV. Objects like: 

]' 1 i i n|' 

composed of an array and a subset of the index space of the array, are the 
regions of [CI96]. Wc will use here generalized regions, of the form: 

T4[f(a' )YAb.-\-B -1- m Ot' 

where f is aXne, A and B are constant matrices, m is a constant vector, and 
a is the vector of the outer variables. 

As to the sequencing predicate in (2.7), it stays the same whatever the 
type of the candidate statement, since wc arc supposing here that the corre- 
sponding operation is executed in one step. There is, however, a problem with 
the computation of the latest source, i.e. with the maximum of the candidate 
set, whose new form is: 

Q,(b) = ^a' *Bs^a ng^'A^a + H, + m 0' a b'f.,(a' ) = g(b)0 

We know that sources belonging to dilfcrent iterations are executed according 
to lexicographic order, but what of sources belonging to the same iteration? 
There are several possible situations here. 

In the simplest case, that of examples Z and ZT, the rules of the language 
insure that there cannot be an output dependence in the doall loop or in 
the vector assignment. This means that is uniquely determined by the 
subscript equations whenever a and b are known. Hence, there will never be a 
comparison between sources at the same iteration: we can use any convenient 
order on the components of , lexicographic order for instance. 

In the case of example R there is no such condition on the implicit do 
loops. But, fortunately, the language dej nition stipulates that these loops 
are executed in the ordinary way, i.e. in order of lexicographically increasing 
, as above. 
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4.2 ADA in the Presence of Complex Statements 

To summarize the preceding discussion, in the presence of complex state- 
ments, the i rst step is the determination of read and modij ed regions. The 
usefulness of modij ed regions is obvious. Read regions delimit the set of 
memory cells for which sources are to be calculated; their inner variables 
are simply added as new parameters to the coordinates of the observation 
statement. In the simple cases we have already examined, the regions can 
be extracted from a syntactical analysis of the program text. See the next 
section for a more complicated case. 

The analysis then proceeds as in the case of simple statements, the inner 
variables being considered as Evirtual loop counters (which they are in 
examples R and Z). The corresponding components are then eliminated or 
kept, depending on the application, and the direct dependences are combined 
as above. 

4.3 Procedure Calls as Complex Statements 

A procedure or function call is a complex statement, as soon as one of its 
arguments is an array, provided the procedure or function can modify it. 
This is always possible in Fortran or C. In Pascal, the array argument has to 
be called by reference. In contrast with the previous examples, one does not 
know beforehand which parts of which arguments are going to be modij ed. 
This information can only be obtained by an analysis of the procedure body 
itself. 

4.3.1 Computing the Input and Output Regions of a Procedure. 

The case of the output region is the simplest. A cell is modij ed as soon as 
there is a an assignment to it in the code. Consider the following assignment 
statement: 

for a := «< 

M[ f(a)] := ^ 

The associated region is simply: 

T4[f( )]<E ntt> 

The constraints of the region are given by the bounds of the surrounding 
loops. 

We have to collect all such subregions for a given argument. The result 
may have redundancies whenever a memory cell is written into several times. 
This redundancy is harmless, since the write order is not signij cant outside 
the procedure. It may however be a source of ineXciency. It can be removed 
either by polyhedra handling methods or by the following systematic proce- 
dure. Suppose we add at the end of the procedure a j ctitious observation 
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operation for each cell of each argument^, and that we compute the corre- 
sponding source. The result is a quast which depends on the subscripts of 
the array cell, . For each leaf whose value is not % we may construct a 
subregion: 

Tf[ ]'C( )T' 

where C is the context of the distinguished leaf. The result will have no 
redundancy. 

The computation of the input region is more diXcult. Notice j rst that it 
is not the same thing as the read region, as shown by the elementary example: 

1 : X := . . . ; 

2 : ... : = X ; 

X is read bTit is not in the input region, since its entry value is killed by 
Si- Computing the input region as accurately as possible is important, since 
a source is to be computed for each of its cells in the calling routine. Re- 
dundancies will induce useless computation; inaccuracies generate spurious 
dependences and lessen parallelism. The solution is to compute the earliest 
access to each cell of each argument of the procedure. One collects all accesses 
to a cell in the body of the procedure, whether reads or writes. This gives a 
set of candidates, of which one computes the lexicographic minimum using 
the same technology as in the source computation^ . The resulting quast gives 
the earliest access to each argument cell as a function of its subscripts. If the 
access is a read, the cell is in the input region. If it is a write, it is not. Lastly, 
if the leaf is \. then the cell is not used in the procedure. Subregions of the 
input are associated to read leaves in the quast, and are constructed in the 
same way as in the case of the output region. 

If the procedure is not a static control program, we have to use techniques 
from FAD A when computing the input and output regions. Fuzziness in the 
input region is not important. It simply means the loss of some parallelism. 
Fuzziness in the output region is more critical, and may preclude Data ow 
Analysis of the calling routine, for reasons which have been explained above 
(see Sect. 3.1). 

This analysis gives the input and output regions of a procedure, but they 
are expressed in term of the procedure formal arguments. To proceed to the 
data ow analysis of the calling routine, we have to translate these regions 
in term of the caller variables, i.e. in term of the actual arguments of the 
procedure. This translation is easy in Pascal, since actual and formal param- 
eters must have exactly the same type: one has simply to change the name 

^ e.g., a print statement. 

® Note that there is a subtle point in the use of rule 3 for this problem. We may 
have to compare an operation to itself, if it includes both a read and a write to 
the same cell. Obviously, the read always occurs before the write. In the line: 
s : = s + 1 ; 

the read of s occurs before the write, hence s is in the input region. 
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of formal arrays to actual arrays in each subregion. In the case of Fortran or 
C, where the typing is less strict, one has to exhibit the addressing function 
(or linearization function) of the formal and actual arrays. The relation be- 
tween actual and formal subscripts is obtained by writing that the two array 
accesses are to the same memory cell, and that the subscripts are within the 
array bounds. In simple cases, one may j nd closed formulas expressing one 
of the subscript set in term of the other. If the bounds are explicitly given 
numbers, the problem can be solved by ILP. There remains the case of sym- 
bolic array bounds, in which one has to resort to ad hoc methods which are 
not guaranteed to succeed [CI96] . 

4.3.2 Organizing the Analysis. In Fortran, procedures cannot be recur- 
sive. Hence, one may draw a call tree. The interprocedural data ow analysis 
can be done bottom up. Leaves call no procedure, hence their regions can 
be calculated without diXculty. If the input and output regions of all called 
procedures are known, then the input and input regions of the caller can be 
computed. When all input and output regions are known, then array data ow 
analysis can be executed independently for all procedures. 

Input and output regions can be stored in a library, along with other infor- 
mation about the procedure, such as its type and the type of its arguments. 
Care should be taken, however, that the region information is not intrinsic 
to the procedure and has to be computed again whenever the procedure or 
one of the procedures it calls (directly or indirectly) is modij ed. 

Input and Output regions calculation for recursive procedures is an open 
problem. It is probably possible to set it up as a j xpoint calculation, but 
all technical details (monotony, convergence, complexity, ...) are yet to be 
studied. 

[CI96] gives another method for computing input and output regions. 
Regions are approximated by convex polyhedra, and data ow equations are 
used to propagate regions through the program structure. The overall orga- 
nization of the computations is the same as the one given here. 



5. Applications of ADA and FAD A 

All applications of ADA and FADA derives from two facts: 

The method is static: it can be used at compile time, without any knowl- 
edge besides the program text. 

The result is a closed representation of a dynamic phenomenon: the creation 
and use of values as the execution of the program proceeds. 

One may in fact consider that the data ow of a program is one possible 
representation of its semantics. If this is stipulated, then ADA is a way of ex- 
tracting a semantics from a program text. FADA gives the same information, 
but with lesser precision. Hence, ADA and FADA are useful as soon as one 
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needs to go beyond the Eword for word or Efeentence for sentence translation 
that is done by most ordinary compilers. Case in points are program under- 
standing and debugging, all kinds of optimization including parallelization, 
and specially array expansion and privatization. 



5.1 Program Comprehension and Debugging 

A very simple application of ADA and FADA is the detection of uninitialized 
variables. ach occurrence of a\ in a source indicates that a memory cell 
is read but that there is no write to this cell before the read. If we are 
given a complete program, this clearly suggests a programming error. The 
program has to be complete: it should include all statements which can set 
the value of a variable, including read statements, initializations, and even 
hidden initialization by, e.g., the underlying operating system. Note that the 
presence of a \in a source is not absolute proof of an error. For instance, in: 

X : = y * z ; 

y may be uninitialized if one is sure that z is zero. In the case of ADA, 
the access to an uninitialized variable may be conditional on the values of 
structure parameters. An example is: 

do i = l,n 

1 s = . . . 

end do 

2 X = s 

The source of s in S 2 is if n Ithen — f'n'ltelse \. There is an error if 
n < 1. This situation may be explicitly forbidden by the program dej nition, 
or, better, by a test on n just after its dej ning statement. One may use any 
number of techniques to propagate the test information through the program 
(see e.g. [JF90]) and use it to improve the analysis. 

The situation is more complicated for FADA. The presence of a \ in a 
source indicates that, for some choice of the intractable predicates, an access 
to an uninitialized variable may occur. But this situation may be forbidden 
by facts about the intractable predicates that we know nothing about, or that 
we are not clever enough to deduce from the program text. In this situation, 
one should either shift to more precise analyses (for instance use semantical 
knowledge), or use program proving techniques to show that the error never 
occurs. 

The same technology which is used for ADA can be reused for checking 
the correctness of array accesses. The results take the form of conditions 
on the structure parameters for the subscripts to be within bounds. These 
conditions can be tested once and for all as soon as the values of structure 
parameters are known, giving a far more eXcient procedure than the run-time 
tests which are generated by some Pascal compilers. 
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The knowledge of exact sources allows the translation of a program into 
a system of recurrence equations (SR ): 

SRE 

8i/V., : vja] = £(oi>i>'Ufe[/ifc(a)]'t>oi>)'f = Tn' (5.1) 

where is the domain of the equation (a set of integer vectors), vi and Vk are 
Bcariables (functions from integer vectors to an unspecij ed set of values), 
and the fik are dependence functions. S is an arbitrary expression, most of 
the time a conditional. Systems of recurrence equations where introduced in 
[ MW67]. Concrete representations of such systems are used as the starting 

point of systolic array synthesis (see for instance [LM 91]). 

To transform a static control program into an SR , j rst assign a dis- 
tinct variable to each assignment statement. The domain of V{ associated 
to Statement is the iteration domain of S^, and the left hand side of the 
corresponding equation is simply Uj[a] where a scans Pj. The expression S 
is the right hand side of the assignment, where each memory cell is replaced 
bu its source. If some of the sources include \Q, the original arrays of the 
source program are to be kept as non mutable variables and each \ is to be 
converted back to the original reference (see [Fea91] for details). 

As an example of semantics extraction, consider the program piece: 

for i := 1 to n do m[i] := m[i+l] 

The source of m[i+l] is \. The equivalent S R is: 

u[f] = m[i + 1]' f = 1' nt> 

In the case of: 

fo i := 1 to n do m[i] := m[i-l] 

the source of m[i-l] is if i > Ithen -t'f 0 10|else \. The equivalent S R 
is: 

u[i] = if I > 1 then v[i 0 1] else m[i 0 1]> 

The i rst recurrence clearly represents a Heft shift while the second one is a 
rightward propagation of u[0]. 

An S R, is is a mathematical object which can be submitted to ordi- 
nary reasoning and transformations. One can say that an S R conveys the 
semantics of the source program, and ADA is this sense is a semantics ex- 
tractor. The process can be pursued one step further by recognizing scans 
and reductions [RF93]. 

One can also think of an S R as a (Dynamic) Single Assignment program. 
Most of the time, the memory needs of a DSA program are prohibitive. It is 
best to think of a DSA program (or of the associated S R.) as an intermediate 
step in the compilation process. 

The results of FADA are to be thought of as an approximate semantics. 
It is much more diXcult to convert them into something approaching a well 
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dej ned mathematical object. One has to resort to dynamically gathered in- 
formation to select the real source among the elements of a source set. The 
reader is referred to [GC95] for details. 

5.2 Parallelization 

The main use of source information is in the construction of parallel pro- 
grams. Two operation in a program are in (data) dependence if they share a 
memory cell and one of them at least niodij es it. Dependent operations must 
be executed sequentially. Other operations can be executed concurrently. De- 
pendences arc classij cd as ow dependences, in which a value is stored for 
later use, and anti- and output dependences, which are related to the sharing 
of a memory cell by two unrelated values. The later type of dependence can 
be removed by data expansion, while ow dependences are inherent to the 
underlying algorithm. It follows that maximum parallelism is obtained by 
taking into account the source relation only: an operation must always be 
executed after all its sources. 

These indications can be formalized by computing a schedule, i.e. a func- 
tion which gives the execution date of each operation in the program. All 
operations which are scheduled at the same time can be executed in parallel. 
For reasons which are too complicated to explain here (see [Fea89] ) , one docs 
not have to use the exact execution time of each operation when computing a 
schedule, provided the amount of parallelism is large. One may suppose that 
all operations take unit time. The schedule iFmust then satisfy: 

ADA:causal 

(5.2) 

FADA:causal 

^ (u) : H^u) liy) + T (5.3) 

in the case of FAD A. These systems of functional inequalities usually have 
many solutions. For reasons of expediency, one selects a particular type of 
solution (in fact, the solutions which are aXne functions of the loop counters) 
and solve either (5.2) or (5.3) by linear programming. The reader is referred 
to [Fea92a, Fea92b] for details of the solution method. 

Some programs do not have aXne schedules i.e. the associated linear 
programming problem proves infeasible. In that case, one must resort to nml- 
tidiinensional schedules, in which the value of Kis a d dimensional vector. 
The execution order is taken to be lexicographic order. Suppose we are deal- 
ing with a -deep loop nest. Having a d dimensional schedule means that 
the parallel program has d sequential loops enclosing (S> d parallel loops. 
Using such schedules may be necessary because the source program has a 
limited amount of parallelism, or because we are using overestimates of the 



I{u) H.’ (u)) + h 

for all operations u in the case of ADA, and 
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dependences from FADA, or simply because we want to adapt a schedule to 
a parallel processor by artij cially reducing the amount of parallelism. 

5.3 Array Expansion and Array Privatization 

It is easy to see that the degree of parallelism of a program is closely related 
to the size of its working memory (the part of its memory space the pro- 
gram can write into), since independent operations must write into distinct 
memory cells. Consider a loop nest that we hope to execute on P processors. 
This is only possible if the nest uses at least P cells of working memory. 
Parallelization may thus be prevented by too small a memory space, since 
programmers have a natural tendency to optimize memory. A contrario, a 
parallelizer may have to enlarge the working memory to obtain an eXcient 
parallel program. 

This can be done in two ways. Consider for instance the kernel of a matrix- 
vector code: 

for i := 1 to n do 
begin 

1 : s := 0; 

for j := 1 to n do 

2 : s := s + a [i , j] *x [j] 

end 

The working memory is s of size one, hence the program is sequential. The 
i rst possibility is to privatize s, i.e. to provide one copy of s per proces- 
sor. How do we know that this transformation is allowed? Observe that our 
objective here is to j nd parallel loops. If the i loop, for instance, is paral- 
lelized, distinct iterations may be executed by distinct processors. Since each 
processor has its copy of s, this means there must not be any exchange of 
information through s between iterations of the i loop. The same is true 
for the j loop if we decide to parallelize it. Now consider the source of s in 
statement 2. It is easily computed to give: 

" (PA2' j 1) = if j > Ithen -l>'A2'j 0 T Itelse -k'A Ijb 



It is clear that there is no information ow from iteration* to i A f . There 
is, on the contrary, a data ow from iteration^ 0 1 to iteration j. This show 
both that the i loop is parallel and that s must be privatized, giving: 

forall i := 1 to n do 
begin 

s : private real; 

1 : s := 0; 

for j := 1 to n do 

2 : s := s + a[i,j]*x[j] 

end 
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This method generalizes to array privatization. For another approach, see 
[TP94], 

There is however another method, which is to resort to array expansion 
instead of array privatization. The j rst idea that comes to mind is to use the 
Dynamic Single Assignment version of the program, thus insuring that all 
output dependences are satisj ed. The result in the above case is: 

forall i := 1 to n do 
begin 

1 : sl[i] := 0; 

for j := 1 to n do 

2 : s2[i,j] := (if j > 1 

then s2[i,j-l] 

else sl[i]) + a[i,j]*x[j] 

end 

Notice however that while the original memory size was 0(1), it is now O(n^), 
the amount of parallelism being only 0(n). The degree of expansion is clearly 
too large. It is possible, by analyzing the life span of each value in the pro- 
gram, to i nd the minimum expansion for a given schedule [LF97]. In the 
present case, one j nds: 

forall i := 1 to n do 
begin 

1 : s [i] := 0; 

for j := 1 to n do 

2 : s[i] := s[i] + a[i,j]*x[j] 

end 

Suppose we are using P processors. This may still be too much if n is much 
larger than P. as it should for eXciency sake. The solution is to adjust 
the schedule for the right amount of parallelism. The optimal schedule is 
/•(I' V 2' j' 1) = j which should be replaced by the two dimensional version: 

Hh T 2' F 1) = . > 

' ' i mod P 

The resulting program is®: 

forall ii := 1 to P do 
for k := ii to n by P do 
begin 

s [ii] : = 0 ; 
for j := 1 to n do 

s[ii] := s[ii] + a[k,j]*x[j] 

end 

The amount of expansion is now exactly equal to the amount of parallelism. 
® For simplicity, we have supposed that P divides n. 



1 : 
2 : 




214 Paul Feautrier 



6. Conclusions 

Let us take a look at was has been achieved so far. We have presented a 
technique for extracting semantics information from sequential imperative 
programs at compile time. The information we get is exact and, in fact, 
exhaustive in the case of static control programs. In the case of less regular 
programs, we get approximate results, the degree of approximation being in 
exact proportion of the irregularity of the source code. 

Array Data ow information has many uses, some of which have been 
presented here, in program analysis and checking, program optimization and 
program parallelization. There are other applications, some of which have not 
been reported here due to lack of space [RF93] , while others are still awaiting 
further developments: consider for instance the problem of improving the 
locality of sequential and parallel codes. 

There are still many problems in the design of Array Data ow Analyses. 
For instance, what is the relation between FADA and methods based on Ab- 
stract Interpretation? What is the best way of taking advantage of semantical 
information about the source program? Can we extend Data ow Analysis to 
other data structures, e.g. trees? All these questions will be the subject of 
future research. 



A. Appendix : Mathematical Tools 

The basic reference on linear inequalities in rationals or integers is tlie treatise 
[Sch86]. 

A.l Polyhedra and Polytopes 

There are two ways of dej ning a polyhedron. The simplest one is to give a 
set of linear inequalities: 

Ax + a 0> 

The polyhedron is the set of all x which satis j es these inequalities. A poly- 
hedron can be empty the set of dej ning inequalities is said to hmnfeasible 
or unbounded. A bounded polyhedron is called a polytope. 

The basic property of a polyhedron is convexity: if two points a and b 
belong to a polyhedron, then so do all convex combinations ]Sk+ 0 | 

N I 1 . Conversely, it can be shown that any polyhedron can be generated 
by convex combinations of a j nite set of points, some of which rays may 
be at inj nity. Any polyhedron is generated by a minimal set of vertices and 
rays. 

There exist non-polynomial algorithms for going from a representation by 
inequalities to a representation by vertices and rays and vice-versa, acti rep- 
resentation has its merits: for instance, inequalities are better for constructing 
intersections, while vertices are better for convex unions. 
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The basic algorithms for handling polyhedra are feasibility tests: the Fou- 
rier-Motzkin cross-elimination method [Fou90] and the Simplex [Dan63]. The 
interested reader is referred to the above quoted treatise of Schrijver for de- 
tails. Both algorithms prove that the object polyhedron is empty, or exhibit 
a point which belongs to it. For dej niteness, this point often is the lexi- 
cographic minimum of the polyhedron. In the case of the Fourier-Motzkin 
algorithm, the construction of the exhibit point is a well separated phase 
which is omitted in most cases. 

Both the Fourier-Motzkin and the Simplex are variants of the Gaussian 
elimination scheme, with dihhrent rules for selecting the pivot row and col- 
umn. Theoretical results and expereince have shown that the Fourier-Motzkin 
algorithm is faster for small problems (less than about fO inequalities), while 
the Simplex is better for larger problems. 

A. 2 Z-modules 

Let t>i' >l>l>" be a set of linearly independent vectors of Z” with integral 
components. The set: 

£(apt>OI>'-C„) = %0iVi + >00+ 9nVn ,/ZO 

is the Z-module generated by tii' 000 "ti„. The set of all integral points in 
Z” is the Z-niodule generated by the canonical basis vectors (the canonical 
Z-niodule). 

Any Z-module can be characterized by the square matrix + of which 
ai'000'ii„ are the column vectors. However, many diferent matrices may 
represent the same Z-module. A square matrix is said to be unimodular if it 
has integral coeX dents and if its determinant is o 1. Let d be a unimodular 
matrix. It is easy to prove that V and VU generate the same lattice. 

Conversely, it can be shown that any non-singular matrix V can be writ- 
ten in the form V — HU where U is unimodular and H has the following 
properties: 

H is lower triangular. 

All coeX dents of H are positive, 

The coeXcients in the diagonal of H dominate coeXcients in the same row. 

H is the Hermite normal form of V. Two matrices generate the same Z- 
module if they have the same Hermite normal form. The Hermite normal 
form of a unimodular matrix is the identity matrix, which generates the 
canonical Z-module. 

Computing the Hermite normal form of an n /'n matrix is of complexity 
O(n^), provided that the integers generated in the process are of such size 
that arithmetic operations can still be done in time 0(1). 
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A. 3 Z-polyhedra 

A Z-polyhedron is the intersection of a Z-modiile and a polyhedron: 

F — % Jk, yC{Vy Az + a 0O> 

If the context is clear, and if C{V) is the canonical Z-module (V = I), it may 
be omitted in the dej nition. 

The basic problem about Z-polyhedra is the question of their emptiness 
or not. For canonical Z-polyhedra. this is the linear integer programming 
question [Sch86, Min83]. Studies in static program analysis use either the 
Omega test [Pug91] which is an extension of Fourier-Motzkin, or the Gomory 
cut method, which is an extension of the Simplex [Gom63]. 

Both the Omega test and the Gomory cut method are inherently non 
polynomial algorithms, since the integer programming problem is known to 
be NP-complete. 

A. 4 Parametric Problems 

A linear programming problem is parametric if some of its elements e.g. the 
coeXcients of the constraint matrix or those of the economic function de- 
pend on parameters. In problems associated to parallelization, it so happens 
that constraints are often linear with respect to parameters.. In fact, most of 
the time we are given a polyhedron P: 




in which the variables have been partitioned in two sets, the unknowns: x, 
and the parameters: y. Setting the values of the parameters to p is equivalent 
to considering the intersection of V with the hyperplane y = p, which is also 
a polyhedron. In a parametric problem, we have to j nd the lexicographic 
minimum of this intersection as a function of p. 

The Fourier-Motzkin method is Eiaturally parametric in this sense. One 
only has to eliminate the unknowns from the last component of x to the 
I rst. When this is done, the remaining inequalities give the conditions that 
the parameters must satisfy for the intersection to be non empty. If this 
condition is verii ed, each unknown is set to its minimum possible value, i.e. 
to the maximum of all its lower bounds. Let Gy -|- c 0 be the resulting 
inequalities after elimination of all unknowns. The parametric solution may 
be written: 

max(/(p)'t>oc>'g(p)) 

min(P A^y = p<» = if Gp -|- c Othen - «< ^ else \ 

^ - max(h(p)' t>C»" fc(p)) . 

where \is the undej ned value and the functions /'t»i>'A; are aXne. 
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The simplex also relies on linear combinations of the constraint matrix 
rows, which can be applied without diXculty in the parametric case. The only 
diXculty lies in the choice of the pivot row, which is such that its constant 
coeXcient must be negative. Since this coeXcient depends in general on the 
parameters, its sign cannot be ascertained; the problem must be split in two, 
with opposite hypotheses on this sign. These hypotheses are not independent; 
each one restricts the possible values of the parameters, until inconsistent 
hypotheses are encountered. At this point, the splitting process stops. By 
climbing back the problem tree, one may reconstruct the solution in the form 
of a multistage conditional. Parametric Gomory cuts can be constructed by 
introducing new parameters which represent integer quotients. The reader is 
referred to [Fea88b] for an implementation of these ideas in the Parametric 
Integer Programming (PIP) algorithm. 
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Summary. Array data flow information plays an important role for successful 
automatic parallelization of Eortran programs. This chapter proposes a powerful 
symbolic array data flow summary scheme to support array privatization and loop 
parallelization for programs with arbitrary control flow graphs and acyclic call 
graphs. Our approach summarizes array access information interprocedurally, us- 
ing guarded array regions. The use of guards allows us to use the information 
in IE conditions to do path-sensitive data flow summary and thereby to handle 
difficult cases. We also provide a mechanism to overcome the disadvantages of non- 
closed union and difference operations. This improves not only the exactness of 
summaries, but also the efficiency of the summarizing procedure. Our preliminary 
result on array privatization shows that our summary scheme is fast and powerful. 

Key words: Parallelizing compilers, array data flow analysis, interprocedural anal- 
ysis, array privatization, guarded array regions, symbolic analysis. 



1. Introduction 

Iiitcrproccdural analysis, i.c. analyzing programs across routine boundaries, 
has been recognized as an essential part of program analysis. Parallelizing 
compilers, as a tool to enliance program e; ciency on parallel computers, 
depend heavily upon this technique to be e/,ective in practice. 

Interprocedural analysis for scalars has been well studied. In contrast, in- 
terprocedural analysis for array referenees is still an open issue. Currently, 
there are two ways to extend program analysis aeross a call site. The Etst 
and simpler way is to use inline expansion. The drawbacks of inlining are 
twofold. First, even if the program size does not increase after inline expan- 
sion, the loop body containing routine calls may grow dramatically. This often 
results in much longer compile time and much larger consumption of mem- 
ory resources [21] because many compiler algorithms dealing with loops have 
non-linear eomplexity with respect to the loop bodyM size. Because a routine 
is analyzed each time it is inlined, duplicate analysis can also cause paral- 
lelizing compilers to be ine; cient. Second, some routines are not amenable 
to inline expansion due to complicated array reshaping and must be analyzed 
without inlining. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 221-246, 2001. 

Springer-Verlag Berlin Heidelberg 2001 
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These drawbacks of inlining make a summary scheme often a more desired 
alternative. Early works on summary schemes summarize the side e,^ects of a 
called routine with sets of array elements that are modiEted or used by routine 
calls, called MO sets and USE sets respectively. ata dependences involving 
routine calls can be tested by intersecting these sets. Existing approaches can 
be categorized according to methods of set representation. Convex regions [27, 
28] and data access descriptors [2] deEhe sets by a system of inequalities 
and equalities, while bounded regular sections [5, 15] use range tuples to 
represent sets. Even though bounded regular sections are less precise than 
convex regions and data access descriptors, they are much simpler and are 
easy to implement. 

The commonalities of these previous methods are as follows. First, be- 
cause they are path-insensitive, IF conditions are not taken into account 
when program branches are handled. Second, they are Vow-insensitive, so 
only MO /USE sets of array elements, which are modiEfed or used respec- 
tively, are summarized. Third, they use a single descriptor (a single regular 
section, a single convex region, etc.) to summarize multiple references to the 
same array. Because union operations are not closed, approximations have to 
be made in order to represent union results in a single convex region or a sin- 
gle regular section. This causes the loss of summary information. Therefore, 
these methods are insu; cient for optimization such as array privatization. 

Recently, Vow-sensitive summary, or array data Vow summary, has been 
a focus in the parallelizing compiler area. The most essential information in 
array data Vow summary is the upward exposed use (UE) set [7,12,14,17,29]. 
Our work [12] and that of M. Hall, et al. [14] use either a list of regular sec- 
tions or a list of convex regions to summarize each array in order to obtain 
more precise information than that provided by a single regular section or 
convex region. Our work is unique in its use of guarded array regions (GARM), 
providing the conditions under which an array region is modiLfed or upward- 
exposedly used. This is in contrast to path-insensitive summaries which do 
not distinguish summary sets for di/,erent program paths. Compared to other 
approaches [7] which handle more restricted cases of IF conditions, our ap- 
proach seems to be more general. 

In this chapter, we will describe our array data Vow summary based on 
guarded array regions in the context of parallelizing compilers. The tech- 
niques involved should also be applicable to various programming supporting 
systems which deal with arrays, e.g. compilers of explicit parallel programs. 
The remainder of the chapter is divided into the following sections. In sec- 
tion 2, we present brief background information. Section 3 discusses guarded 
array regions, followed by our summary algorithm in section 4. We address 
the implementation issues in section 5. The preliminary results using GARK 
for array privatization are shown in section 6. Section 7 brieVy discusses the 
related works. Finally, we conclude this chapter in section 8. 
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2. Preliminary 

In this section, we provide background information regarding array reference 
summary in the context of data dependence analysis and of data Vow analysis. 

2.1 Traditional Flow-Insensitive Summaries 

ata dependence analysis is widely used in program analysis and is a key 
technique for program optimization and parallelization. It is also a main mo- 
tivation for interprocedural analysis. Traditionally, data dependence analysis 
concerns whether two statements modify and use a common memory loca- 
tion. Given two statements, I and 2, which use and modify array A, we 
assume 1 is executed before 2. Let MODg be the set of all elements of 
A that are inodiEted by and U the set of all elements of A that are 
used by . The dependences between i and 2 can be determined by the 
following set operations [30]: 

Vow dependence: i^ MOD si U = ■ 
anti- dependence: i^ MODg 2 U iJi = . 
output dependence: i,^ MODgi MOD s 2 = ■ 

When 1 , 2 , or both, are call statements, the above sets need to be obtained 
by summarizing the called routines. Because summarizing MO and USE sets 
does not involve analysis of how data Vows within routines, e.g. reaching- 
deEhition analysis, live variable analysis, array kill analysis, these summaries 
are referred to as Vow-insensitive summaries. 

The MO and USE sets above are given for a statement. They can also 
be deEhed for an arbitrary code segment which has a single entry and a single 
exit. When these sets cannot be summarized exactly, an over-approximation, 
meaning that summary sets are super sets of the real sets, is computed so 
that dependence tests can still be conducted conservatively. Such an over- 
approximation set is also called a may set. To understand these summary 
schemes, we will brieVy discuss the following two aspects: set representations 
and summaries of multiple array references over multiple control branches. 
A. Set representations 

Several set representations have been proposed so far, including convex 
regions [27,28], data access descriptors [2] and bounded regular sections (for 
simplicity, wc refer to this as regular sections when there is no confusion) 
[5, 15]. Although data access descriptors arc more restrictive than convex 
regions, they basically deEhe sets by a system of inequalities and equalities 
just as convex regions do. ata access descriptors also contain additional 
information besides the shapes of array accesses. Regular sections are simple 
and more restricted in representing power compared with convex regions and 
data access descriptors, but they still cover most cases in practice and can 
be implemented e; ciently. 
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To show the di^t^rence between convex regions and regular sections, con- 
sider the following examples. The set of all array elements in a two dimen- 
sional array A{N^ M), where N and M are dimension sizes, can be represented 
exactly either by a convex region 

< = /c ^ = Jc 1 I / I ATc 1 I j I M<» 

or by a regular section 

A{1 : N : hi : M : 1)> 

The upper triangular half of that array can also be represented exactly by a 
convex region: 

= = J'lT n JO>t> 

However, this array region cannot be represented exactly by a regular section. 
It can only be approximated by A(1 : N : T 1 : M : 1). 

B. Summarizing references 

To facilitate analysis, a code segment that will be summarized has a single 
entry point and a single exit point. In concept, this code segment can be 
viewed as a set of paths. Each path in this set is composed of a sequence 
of statements, from the entry point of the code segment to the exit point of 
the code segment, in the order of statement execution. The summary sets for 
each path are obtained by composition operations, which are union operations 
to combine all array references in the path together. The summary for the 
whole code segment is collected by merging the summary sets of all paths. 
Such merging operations, called meet operations, are union operations too. 
Also, all variables in the Dial summary sets should have their values on entry 
to the code segment. These summary schemes do not distinguish summary 
sets for di,tcrent paths. Instead, they provide summary sets that are super 
sets of those of any path. We call these path-insensitive summaries. 

In practice, a path-insensitive summary scheme can unionize all array ref- 
erences to form summary sets without considering paths, since both meet and 
composition operations are union operations, ariable replacements are car- 
ried out through the summarizing procedure in order to represent summary 
sets in terms of the input of the code segment. The typical code segments are 
loops, loop bodies, and routines. 

Obviously, path-insensitive summary schemes cause may summary sets. 
The may sets can also be introduced when subscripts of array references 
are complicated (e.g. with expressions not representable in the compiler). In 
addition, attempting to represent the summary in a single convex region or 
a single regular section can also cause may sets because union operations 
are not closed. Some previous approaches such as [18] use a list of references 
as a set representation to avoid union operations and thus maintain exact 
summary information. 
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2.2 Array Data Flow Summaries 

Recently, array data Vow summaries, which analyze the intervening array 
kills, have been investigated [7,11,14,17,25,29]. In these array data Vow 
summaries, sets of upward exposed array uses (UE sets), which cannot be 
calculated without considering the e/,ects of array kills, are calculated in 
addition to MO and USE sets. The UE sets corresponding to a code segment 
contain all array elements whose values are imported to the code segment. 
The computation of UE sets usually needs di^erence operations. Let us look 
at an example. Suppose that statement is in a given code segment and 
U denotes all the array elements used by . Let MOD^g be the set 
containing all array elements written before within the code segment. These 
array summary sets are assumed to be corresponding to the same array A. To 
compute the UE sets for the code segment contributed by U 1^, we perform 
the following di^crcnce: 

U MOD^g 

When MOD^s is a may set, the di^erence should be conservatively approx- 
imated by U 1^. Thus, MOD^g does not a^ect the computation oi UE 
at all. Since this causes Vow-sensitive analysis to degenerate, a must set 
(an under-approximation set in contrast to an over- approximation may set), 
which is deDiitely accessed, is introduced. Whenever the MO set is inex- 
act, we can use a must set in a di^erence operation. Therefore, the di,[,erence 
U (S> MOD^g can be safely performed if MOD^g is a must set. Such a 
di/,erence result is more precise even though it is still a may set. 

The meet operations for must sets are intersection operations, while com- 
position operations are still union operations. Must sets are calculated by 
intersecting summary sets for di^^srent branches at branching statements. 
The meet operations for must sets are closed. However, di^t^rence operations, 
which are introduced by computing UE sets, are not closed. 

Previous experiments [3,21] as well as our experience indicate that the 
path-insensitive MUST/MAY array data Vow summary is not enough IF 
conditions need to be considered in order to handle certain important cases 
in real programs. Take Figure 2.1 as an example. The MUST/MAY summary 
for the body of the outer O will produce 

may MOD : A{jlow : jup : ly A{jmax : jmax : 1) 
may UE : A{jmax : jmax : 1) 

The intersection of these two sets is nonempty and hence the compiler has to 
conservatively assume that there exist loop-carried Vow dependences. How- 
ever, the IF condition [tNOTtp) is loop-invariant, which guarantees that no 
loop-carried Vow dependence exists in the outer loop, and therefore array A 
is privatizable. (We will discuss array privatization in section 6.) 

In the next section, we propose a path-sensitive summary scheme which 
uses guarded array regions [12] to handle IF conditions. In order to preserve 
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O I = 1, 4 

O J = jlow, jup 
A(J) = . 

EN O 

IF (.NOT.p) 

A(jmax) = . 

EN IF 

O J = jlow, jup 
. = A(J) + A(jmax) 

EN O 
EN O 

Fig. 2.1. Examples of Privatizable Arrays 



the exactness of union and di/,erence results, we devise a scheme based on 
lists of guarded array regions. Such scheme turns out to enhance the e; ciency 
of our approach as well. 



3. Guarded Array Regions 

Our analysis is based on two basic sets which describe array references, the 
upward exposed use set iUE set) and the modiDcation set (MOD set). The 
UE set is the set of the upward exposed array elements which take values 
deDied outside a given program segment. The MOD set is the set of array 
elements written within a given program segment. 

Our basic unit of array reference representation is a regular array region, 
which is also called a bounded regular section [15]. It is a reduced form 
of the original regular sections proposed by Callahan and ennedy [5]. On 
the other hand, we extend original regular sections in the following ways to 
meet our need to represent UE and MOD sets. First, since references to an 
array often cannot be easily represented by a single regular section without 
losing exactness, we use a list of regular array regions instead. In addition, 
we augment regular array regions by adding predicates as guards in order to 
better deal with IF conditions. The following gives the deQiitions of regular 
array regions and guarded array regions. 

De; nition A regular array region of array A is denoted by A(ri^ U 2 ' 
where m is the dimension of A, r^, f = T m. is a range in the form of 
{I : u : s), and l‘U‘s are symbolic expressions. The triple {I : u : s) represents 
all values from I to u with step s. (I : u : s) is simply denoted by (1) if I = u 
and by (G u) if s = 1. An empty array region is represented by and an 
unknown array region is represented by , . 
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De; nition A guarded array region (GAR) is a tuple [P' R] which contains 
a regular array region R and a guard P, where P is a predicate that speciEtes 
the condition under which R is accessed. We use to denote a guard whose 
predicate cannot be written explicitly, i.e. an unknown guard. If both P = 
and R = , , we say the GAR [P R] = , is unknown. Similarly, if either P is 
False or P is , we say [P' R] is . 

The regular array region dehhed above is more restrictive than the original 
regular section used in the ParaScope environment at Rice University [2,5,15]. 
This basic unit, however, will be able to cover the most frequent cases in real 
programs and it seems to have an advantage in e; ciency when dealing with 
the common cases. The guards in GARK can be used to describe more com- 
plex array sections, although their primary use is to describe IF conditions 
under which regular array regions are accessed. 

An unknown value is used carefully in the compiler in order to preserve 
as much precision as possible. First, one unknown dimension in a multiple 
dimensional array does not make the whole region unknown. Also, if the upper 
bound in a range tuple {I : u : s) is unknown, we mark it as (/ : unknown : s) 
instead of as an unknown tuple. 

We use a list of GARK to represent a MOD set. However, using a list 
of GARK for a UE set could introduce either more predicate operations [12] 
or unnecessary sacriEbe of the exactness of di^t^rence operation results. Our 
solution to this problem is to deDie a data structure called a GAR with a 
diderence list (GARWD). A UE set is represented by a list of GARW K. 
De; nition A GAR with a diderence list (GARWD) is a set deDied by 
two components: a source GAR and a diderence list. The source GAR is an 
ordinary GAR deEhed above, while the di/,erence list is a list of GARK. The 
set denotes all the members of the source GAR which are not in any GAR 
of the di/,erence list. It is written as ^ source GAR, <did erence list> (}. 



0 1 = 1, M 
A(1:N:1) = . 
A(N2) = . 



. = A(2:N1:1) 
EN O 



O I = 1, M 

MOD,-. A(1:N:1), A(N2:N2:1) 

UEi-. ^ A(2:N1:1), < A(1:N:1), A(N2:N2:1) >0 
EN O 



Fig. 3.1. Example of GARW K 



Figure 3.1 is an example showing the use of GARW K. The right-hand 
side is the MO /UE summary for the body of the outer loop shown on 
the left-hand side. The subscript i in both sets means the summary is for 
an arbitrary iteration i. UEi is represented by a GARW . For simplicity, 
we omit the guards here whose values are true in the example. To prove 
that no loop-carried data Vow dependence exists, we need to show that the 
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intersection of UEi and the set of all mods before iteration i is empty. The 
set of all mods within iterations prior to iteration i, denoted as MOD^i, 
is equal to MODi. (In theory, MOD^^ — C ii i = 1. But this does not 
invalidate the analysis above. Similarly, MOD^i denotes the sets of mods 
within the iterations after iteration i.) Since both GARK in the MOD^i list 
are in the di^erence list of the UEi, represented by a GARW , it is obvious 
that the intersection of MOD^i and UE^ is empty. As will be discussed later 
in Section 6, this further shows that array A is privatizable. 



3.1 Operations on GAR s 

Our data Vow analysis requires three kinds of operations on GARK; union, 
intersection, and di,^erence. These operations in turn are based on union, 
intersection, and di,^erence operations on regular array regions as well as 
logical operations on predicates. We will Drst describe the operations on array 
regions, then on GARK, and Ehally on GARW K. 

Regular array region operations 

As operands of the region operations must belong to the same array, we 
will drop the array name from the array region notation hereafter whenever 
there is no confusion. Given two regular array regions, R\ = A(rJ' <« r^) 
and i ?2 = A(rf‘r 2 ‘ where m is the dimension of array A, we deEhe 

the following operations: 

i?l i?2- 

For the sake of simplicity of presentation, here we assume steps of 1 in 
both and i? 2 - Other step values will be handled in Section 5. Let rj = 
(ij : u} : 1), rf — {if : uf : 1), where i = h <« m. and let Di be r\ rf. 
We have Di = {max{l\Af) : min{u\^uf) : 1) and have R\ R -2 equal to 

(Hi' £> 2 ' Otherwise 

Note that we do not keep max and min operators in a regular array re- 
gion. Therefore, when the relationship of symbolic expressions cannot be 
determined even after a demand-driven symbolic analysis is conducted (c. 
/. Section 5), we will mark the intersection as unknown. 

Ri —R’2- 

Since regular array regions may contain symbolic terms, care must be taken 
to prevent invalid regions being created by union operations. For example, 
for R\ = {m : p : 1) and R 2 = {p+ 1 : n : 1), we have Ri csi ?2 = (m : n : 1) 
if and only if both Ri and R 2 are valid. The validity of this union result can 
be guaranteed by inserting validity predicates into guards [f2]. However, 
since this introduces additional predicate operations which we try to avoid, 
we represent the union as a list of regions without merging them until they 
are known to be valid, e.g. when they are both constant regions. 
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Ri ®R2- 

For an m-dimensional array, the result of the di,^erence operation is gener- 
ally 2™ regular regions if each range di/,erence results in two new ranges. 
Such a result could be quite complex for large m. Nonetheless, it is useful to 
present a general formula for the result. Suppose R\ <^=i ?2 ( otherwise, use 
= Ri®R\ i? 2 )- We Etst deEhe R\{k) and R 2 {k), k — h as 

the last k ranges within Ri and R 2 respectively. According to this deEhi- 
tion. we have Ri{m) — (rj' rg' ) and R 2 {m) — (rf' 

and Ri{m® 1) = (rh rh <« and R 2 {m® 1) = irhrh The 

computation of R\ G> R 2 is recursively given by the following formula: 

Ri{m) 0 R 2 {m) = 

{r\ 0 rf) Ifm=l 

{r\ 0 r 2 ' rg' <« ) ~ {rf^ 0l)0i?2(?7r0l))) Ifm>l 

To avoid a potentially complex result and also to keep the summary infor- 
mation precise, we use GARW Kto represent di/,erence operations when- 
ever the resultant di,;,erence is not a single regular region. We record the 
di/^erence operation in the GARW M without actually performing it until 
it is actually needed. As will be discussed in Section 5, the GARW K also 
improve the e; ciency of our overall scheme. 

The following gives examples of the actual results of di^t^i^si^ce operations, 

(1 : 100 ) 0 (2 : 100 ) = ( 1 ) 

(1 : 100' 1 : 100) 0 (3 : 99' 2 : 100) 

= ((1 : 100) 0 (3 : 99)' (1 : 100)) ~ ((3 : 99)' ((1 : 100) 0 (2 : 100)) 

= (((1 : 2) - (100))' (1 : 100)) ~ (3 : 99' 1) 

= (1 : 2' 1 : 100) ~ (100' 1 : 100) ~ (3 : 99' 1) 

GAR operations: 

Given two GARM, Ti — [Pi' Pi] and T 2 = [P 2 ' P 2 ], we have the following: 

Tl T 2 — [Pi ^P 2 ' Pi P 2 ] 

Ti -T 2 

The following are two most frequent cases of union operations: 

If Pi — P 2 , the union becomes [Pi' Pi — P 2 ] 

If Pi = P 2 , the result is [Pi P 2 ' Pi] 

If two array regions cannot be safely combined due to the unknown sym- 
bolic terms, we will keep two GARM in a list without merging them. 

Tl 0 ?2 = [Pi =^p2‘ Pi 0 P 2 ] — [T*! ^P2' Pi] 

As described earlier, the actual result of Pi 0 P 2 may be multiple regular 
array regions, which makes the actual result of Ti 0 T 2 potentially com- 
plex. However, as Figure 3.1 illustrates, di/,erence operations can often be 
canceled by intersection and union operations. Therefore, we do not solve 
the dii, erence Ti 0 Tg unless the result is a single GAR or until the last 
moment when the actual result must be solved in order to Ehish a data 
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dependence test or an array privatizability test. When the di^erence is not 
solved with the above formula, it is represented by a GARW 

The following shows examples of actual results of GAR operations: 

[T' (1 : 100)] [f (2' 101)] = [p' (2 : 100)] 

]T' (1 : 50)] ~ [T^ (5T 100)] = [T^ (1 : 100)] 

[T' (1 : 100)] ~ (T 100)] = [T' (1 : 100)] 

[T' (1 : 100)] 0 [T' (2 : 99)] 

= [T^ ((1 : 100) 0 (2 : 99))] = [T= ((1) (100))] 

= [T^(l)]cs[T=(100)] 

(2 : 99)] 0 [T^ (1 : 100)] = [p' ((2 : 99) 0 (1 : 100))] = [p' ] = 

GARWD operations: 

Operations between two GARW M and between a GARW and a GAR 
can be easily derived from the operations on GARK Some examples are given 
below to illustrate these operations: 

tT' (1 : 100)]' < [T' (n : m)] >^0 [T' (2 : 100)] 

= %[T^ (1 : 100)] 0 [T' (2 : 100)])' < [T' (n : m)] >0 

= (1)]' < [T' (n : m)] >0 

%T‘ (1 : 100)]' < [T' (n : m)] >0 [p' (101 : 200)] 

= f[T' (1 : 100)] [p' (101 : 200)])' < [T' (n : m)] >0 

= ^( )' < [T' (n : m)] >0= 

%T^ (1 : 100)]' < [T' (n : m)] (1 : 100)]' <>0= 1[T' (1 : 100)]' <>0 

3.2 Predicate Operations 

Predicate operations are expensive in general and most compilers avoid ana- 
lyzing them. However, the majority of predicate-handling required for collect- 
ing our array data Vow summary involves simple operations such as checking 
to see if two predicates are identical, if they are loop-independent, or if they 
contain indices that a^ect the shapes or the sizes of array regions. These 
relatively simple operations can be implemented in an e; cieiit way. 

We deEtie a new canonical form to represent predicates in order to sim- 
plify pattern matching needed to check whether two predicates are identical. 
Both conjunctive normal form (CNF) and disjunctive normal form ( NF) 
have been widely used in program analysis [6,26]. These cited works show 
that negation operations are expensive with both CNF and NF. Our pre- 
vious experiments using CNF [12] also conEtm their observations. Negation 
operations occur not only due to ELSE branches, but also due to GAR and 
GARW operations elsewhere. Hence, we design a new canonical form such 
that negation operations can be reduced signiEbantly. 

We use a two-level hierarchical approach to predicate handling. At the 
high level, a predicate is represented by a predicate tree, PT{V^ E^r), where 
V is the set of nodes, E is the set of edges, and r is the root oi PT. The internal 
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operator 
regular leaf 

negative leaf 



A { D) B {E {F G H)) 

Fig. 3.2. High Level Representation of Predicates 



nodes of V are NAN operators except the root, which is an AN operator. 
The leaf nodes are divided into regular leaf nodes and negative leaf nodes. 
A regular leaf node represents a predicate such as an IF condition or a O 
condition in the program, while a negative leaf node represents the negation 
of a predicate. Theoretically, the predicate tree may not be unique for an 
arbitrary predicate, which can cause pattern-matching to be conservative. 
We believe, however, that such cases are rare and happen mostly when the 
program cannot be parallelized. 

Figure 3.2 shows a predicate tree. At this high level, we keep a basic 
predicate as a unit and do not split it. The predicate operations are based 
only on these units without further checking the contents within these basic 
predicates. Figure 3.3 show predicate operations. Figure 3.3(a) and (b) show 
OR and AN operations respectively. Negation of a predicate tree, shown 
in Figure 3.3(c), may either increase or decrease the tree height by one level 
according to the shape of the current predicate tree. In Figure 3.3(c), if there 
is only one regular leaf node (or one negative leaf node) in the tree, the regular 
leaf node is simply changed to a negative leaf node (or vice versa). We use 
a unique token for each basic predicate such that simple and common cases 
can be easily handled without checking the contents of the predicates. At the 
lower level of the hierarchy, the content of each predicate is represented in 
CNF and is examined when necessary. 
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(c) 



Fig. 3.3. Predicate Operations 

4. Constructing Summary GAR’s Interprocedurally 

4.1 Hierarchical Supergraph 

In this section, we present algorithms to calculate the MOD and UE in- 
formation by propagating the G ARK and the G ARW K over a hierarchical 
supergraph (HSG). The HSG in this paper is an enhancement of AlyersKsu- 
pergraph [22] which is a composite of the Vow subgraphs of all routines in 
a program. In a supergraph, each call statement is represented by a node, 
termed a call node in this paper, which has an outgoing edge pointing to the 
entry node of the Vow subgraph of the called routine. The call node also has 
an incoming edge from the unique exit node of the called routine. To facil- 
itate the information summary for O loops, we add a new kind of nodes, 
loop nodes, to represent O loops. The resulting graph, which we call the hi- 
erarchical supergraph (HSG), contains three kinds of nodes basic blocks, 
loop nodes and call nodes. An IF condition itself forms a single basic block 
node. A loop node is a compound node which has its internal Vow subgraphs 
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IF (X > 10) THEN 
B= 10 

CALL subl(A,B) 

ELSE 
B = 20 

CALL subl(A,B) 

ENDIF 

C=... 

SUBROUTINE subI(A,M) 
DO 1=1, m 
A(I) = .. 

ENDDO 

END 




Fig. 4.1. Example of the HSG 



describing the control Vow within the O loop. ue to the nested structures 
of O loops and routines, a hierarchy is derived among the HSG nodes, with 
the Vow subgraph at the highest level representing the main program. The 
IISG resembles the IISCG used by the PIPS project [16]. Figure 4.1 shows a 
HSG example. Note that the Vow subgraph of a routine is never duplicated 
for di^erent calls to the same routine unless the called routine is duplicated 
to enhance its potential parallelism. We assume that the program contains no 
recursive calls. For simplicity of presentation, we further assume that a O 
loop does not contain GOTO statements which make premature exits. We 
also assume that the HSG contains no cycles due to backward GOTO state- 
ments. Our implementation, however, does take care of premature exits in O 
loops and backward GOTO statements, making conservative estimates when 
necessary. In the Vow subgraph of a loop node, the back edge from the exit 
node to the entry node is deliberately deleted, as it conveys no additional 
information for array summaries. Under the above assumptions and treat- 
ment, the HSG is a hierarchical dag (directed acyclic graph). The following 
subsection presents the information summary algorithms. 



4.2 Summary Algorithms 

Suppose we want to check whether loop L is parallel. Let g{s^ e) be the HSG of 
the subgraph of loop L (the graph for the loop body), where s is the starting 
node and e the exit node. We use UE{n) and MOD(n) to denote the upward 
exposed use set and the mod set for node n respectively. Furthermore, we 
use MOD_IN{n) to represent the array elements modiEfed in nodes which 
are reachable from n, and we use UE_IN{n) to represent the array elements 
whose values are imported to n and used in the nodes reachable from n. 
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Each MOD{n) or MOD_IN(n) set is represented by a list of GARM and each 
UE[n) or UE_IN{n) set by a list of GARW K. If any GAR contains variables 
in its guard or its range triples, these variables should assume the values 
imported at the entry of n. The compiler performs the following recursive 
algorithm, garsummary: 

1. Summarize all basic block nodes within g[s^e). 

For each basic block node n, we calculate the MOD{n) and UE{n) sets 
for arrays within node n. All predicates in GARK of MOD(n) and UE{n) 
are set to True. 

2. Summarize all compound nodes. 

a) Call nodes. For a call node n, let the subgraph of the call node 
be g (s'e). The compiler recursively applies gar_sumrnary to g (s'e), 
which summarizes the references to global arrays and arrays parame- 
ters. The returning results are mapped back to the actual arguments 
of the procedure call and are denoted by MOD(n) and UE{n). 

b) loop nodes. For a loop node n, let the subgraph of the loop node 
be g (s'e). The compiler recursively applies gar_surmrmry to g (s'e). 
Let UEi(n) and MODi{n) represent the UE and MOD sets of 
g (s' e), which contain modiEfcation GARK and upwards exposed use 
GARW Kin one loop iteration indexed hyi. The sets MODi(n) and 
{UEi[n) - MOD^i[n)) are then expanded across the i index range 
to form MOD{n) and UE{n) for loop node n 

3. Propagate the array data Vow information. 

From node e to s, the gar_summary algorithm traverses the nodes in 
g{s^ e) in a reverse topological order. uring the summary propagation, 
the following Vow equations are applied: 

MODJN{n) = MOD{n) m ^ MODJN(p)) 

UEJN{n) = UE{n) m ^ UEJN{p) 0 MOD{n)) 

Note that the set of successors of the exit node succ(e) is . When ap- 
plying the Vow equations, we need to handle the following situations. 

If n is a basic block containing an IF-condition, insert the correspond- 
ing predicate or its negation into the guard of each GAR which has 
been propagated backward into MOD_IN{n) and UE_IN{n) through 
the THEN edge or the ELSE edge respectively. 

If any expression in the MOD_IN{p) and UE_IN{p) contains a vari- 
able that is deEhed within n, then that variable must be substituted 
by the right-hand-side of the deEhing statement within n. Therefore, 
all values of variables in MOD_IN{n) and UE_IN(n) are relative to 
the entry to n. 

For loop L, the above algorithm produces the MODi and UEi sets of g{s^e), 
where MODi equals MOD_IN{s) and UEi equals UEJN{s). These sets 
can be used to do dependence tests and array privatization tests, which are 
presented later in this chapter. 
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Fig. 4.2. The HSG of the Body of the Outer Loop for Figure 2.1 

Let us use the example in Figure 2.1 to illustrate the algorithm. A simpli- 
Eted HSG of the body of the outer loop is shown in Figure 4.2, which omits the 
internal Vow graphs of the compound nodes. Following the algorithm above, 
Step 1 and Step 2 give us the following results: 

MOD{l) = [T^ {jlow : jup)Y UE{1) = 

MOD {2) = ' UE{2) = 

MOD{3) = [D {jmax)Y UE{3) = 

MOD{4) = ' UE{4}) = [D [jlow : jup)] ^[D {jmax)] 

Step 3, following a reverse topological order, propagates the information in 
the following four sub-steps: 

1. MOD_IN{4) = MOD{A) = 

UEJN{4) — UE{4) — [T' {jlow : jup)] ~ [D (jmax)] 

2. MODJN{3) = MOD{3) ~MOD{4) = [D {jmax)] 

UE_IN{3) = UE{3) - {UE_IN{4) 0 MOD{3)) = %D {jlow : jup)Y < 
[D {jmax)] >0 

3. MODJN{2) — [p' {jmax)] 

UE_IN{2) = % p' {jlow : jup)Y < [p‘ {jmax)] >0~ [p' {jlow : jup)] 0 :^ 
[p‘ {jmax)] 

In the above, p is inserted into the guards of the GARK which are prop- 
agated through the TRUE edge and p is inserted into those propagated 
through the FALSE edge. 

4. MOD_IN{l) — [D {jlow : jup)] [p‘ {jmax)] 

UE_IN{1) = UEJN{2)(E)MOD{l) = 1[p= {jmax)Y < [D {jlow : jup)] > 

0 

Therefore, the summary sets of the body of the outer loop ( O I) should be: 

MODi = MODJN{\) = [D {jlow : jup)] ~[p= {jmax)] 

UEi — UE_IN{\) — ^[p' {jmax)Y < [T' {jlow : jup)] >0 

4.3 Expansions 

Given a loop with index i, where i {I ■ u : s), and MOD^ and UEi as 
the summary sets for the loop body, what are the UE and MOD sets for 
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the loop? This problem can be solved by an expansion procedure which is 
conducted in the following two steps: 

UE{i) ^UEi® MOD^p where MOD^i^ j (/) 

MOD = , 

UE = ii.u:s)UE{i) 

Step I in the above computes the set of array elements used in iteration 
i which are exposed to the outside of the whole loop. Step II projects the 
result of Step I to eliminate i. The projection for a list of GARK is conducted 
by the projection for each GAR separately. Suppose Q is a GAR. If Q does 
not contain i in its representation, then the projection of Q is Q itself. If Q 
contains i in its representation, then the projection of Q by i, denoted by 
proj(Q), is a GAR, obtained by the following steps: 

If i appears in the guard of Q. then i should be solved from the guard 
which, in general, is written as i Jj- (Z : u : s), where I and u may be 
symbolic expressions. We obtain a new domain of i as follows, 

, max(El)®l , rjiiniu ^ u) ® I 

^ ^ + I : ./■ cx^ + l : s) 

s s 

which simpliEfes to {max{l G) : min{u ‘u)) for s = 1. The inequalities and 
equalities involving i in the guard can then be deleted. 

For example, given i JJ. (2 : 100 : 2) and GAR [5 | h A{i)], the new domain 
of i should be (6 : 100 : 2). 5 | i is removed from the GAR. 

If i appears in only one dimension of Q, and if the result of substituting 
I 'I i I li, or the new bounds on i obtained above, into the old range triple 
in that dimension can still be represented by a range triple {I : u ■ s ), 
then we replace the old range triple by (Z : u : s ). 

If, in the above, the result of sTibstitution of Z j i } u into the old range 
can no longer be represented by a range, then we mark this range as , 
(unknown). (Better approximation is possible for special cases, but we have 
not pursued further.) 

If i appears in more than one dimension of Q, then these dimensions are 
marked as unknown. 

As an example, suppose a O loop indexed hyi has loop bounds 1 } i ] 100. 
Further, suppose the given GAR isQ = [3T Z + lt 5T (1 : i : 1)]. As 
the result of solving i from the guard, we have new bounds on i which are 
(maa:(T3 0 1) : mm(100‘51 0 1)) = (2 : 50). The expansion of Q by Z is 
[True^l : 50 : 1)]. 

Since the UEi set is represented by a list of GARW M, the expansion of 
UEi can be obtained by unionizing the expansion of the individual GARW M 
in UEi. Therefore, for each GARW W in UEi, we need to compute {W 0 
MOD^i) and then its projection. If i cannot be removed from the di^erence 
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list of a GARW by simplifying the GARW , we over-approximate this 
GARW by removing the whole di^erence list from the GARW . As we 
mentioned previously, GARW is used to represent UE sets only. Therefore, 
over-approximation is always safe. Afterwards, each GARW in iheUEi set 
either has an empty di/,erence list or one that is loop-invariant. Now only the 
source GAR of a GARW needs to be considered below. 

The di,^erence {W G> MOD^j) is calculated by successive subtractions of 
each GAR of MOD^i from W. Let Qi be the source GAR ofW, i i)- (I : u : s). 
The details of the expansion of a single GARW W, including Step I and 
Step II, are as follows: 

(A) If Q,( is loop-invariant, then we have the expansion of W equal to 
proj{W ® MOD^i) — W. Stop. 

(B) For (each M in MOD^j) ^ 

1. If M is loop-invariant and Qi (the value of Q for the Drst iteration) does 
not overlap with M, we insert M into the di/, erence list of W. Goto 
continue. 

2. If M is loop-invariant and Qi is equal to M, then M does not a^ect the 
expansion. Goto continue. 

3. If Qi M = C, then Qi® M = Qi. Goto continue. 

4. Suppose both Qi and M are loop-variant and the index i appears in the 

same dimension, say dimension j, in both Qi and M. Further suppose 
that Qi and M arc identical in all dimensions except dimension j, in 
which the range tuples contain one single element and can be represented 
by (Et) and{Em) respectively. If [Em®Ef) is not a constant, goto step 5. 
If {Era ® Et) IS a constant, the coe; cients of i in both Qi and M must 

be the same. Let this coe; cient be coe^. (If {Err,, ^ Et) is not divisible 

by {coci ^), then Qi M — C and should be covered in step 3.) When 
both C 06 i and s arc greater than zero, the Qi in the I>st {Em ® Et)<icoei 
number of iterations is exposed to the outside of the loop. Therefore, 

{Qi ® Myi y {I -.U-. s) = 

Qy iy {I :l + {Em 0 Et)<coei ® s : s) E)m® Et) > Q 

Qi^ i y {I : u \ s) E)m® Et) T 0 

For the Il’st case, a new domain for i is established. I and u will have 
new values for the rest of the computation on {W 0 MODrn). 

For other cases in which coet or s is not greater than zero, the formula 
can be acquired similarly. 

Goto continue. 

5. We mark IF as a may set and drop the di/, erence list of W. Compute 
proj(IF), which is the D:ial result of the expansion of the original W. 
Stop. 

6. continue: 

0 
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(C) Compute proj(LL), which is the Ehal result of the expansion of the orig- 
inal W. 

The above steps have been implemented currently and can be extended 
and reEhed. The projection of a GARW W, denoted above by proj(LT), can 
be acquired by projections on each GAR in W separately. 



5. Implementation Considerations 

5.1 Symbolic Analysis 

Symbolic analysis handles expressions which involve unknown symbolic terms. 
It is widely used in symbolic evaluation and abstract interpretation techniques 
to discover program properties such as values of expressions, relationships 
between symbolic expressions, etc. Symbolic analysis requires to represent 
and manipulate unknown symbolic terms. A symbolic analyzer can adopt a 
certain normalized form to represent expressions [6,13,26]. The advantage 
of such a normalized form is that it gives the same representation for two 
congruent expressions, i.e. they always have the same value. For example, 
x^x+ 1) and x + x"^ are congruent expressions. In addition, symbolic expres- 
sions encountered in array data \bw analysis and dependence analysis are 
mostly integer polynomials. Many operations on integer polynomials, such as 
the comparison of two polynomials, can be straightforward if a normalized 
form is adopted. Therefore, we adopt normalized integer polynomials as our 
representation for symbolic expressions, whose form is shown below: 

e= !Llt^<^+to (5.1) 

where each li is an index variable and is a term which is given by equation 
(2) below: 

fJiPf (5.2) 

Pj = j = Mj (5.3) 

where Pj is a product, is an integer constant (possibly an integer fraction), 
x^!^ is an integer variable which is not an index variable, N is the nesting 
depth of the loop containing e. Mi is the number of products in ti, and Lj is 
the number of variables in Pj. All e, and pj are sorted by each variableM 
order number, which is a unique integer number assigned to each variable. 
Since both Mi and L.j contribute to the complexity of a polynomial, they are 
chosen as design parameters which can be adjusted in the compiler, according 
to experimental results, in order to control the complexity of expressions. As 
an example, by limiting Mi to be 1 and L j to be zero, the expression e will 
become an a; ne expression. By controlling the complexity of expressions, 
the complexity of the overall summary scheme is reduced. 
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We implement utility library functions to perform symbolic operations 
such as additions, subtractions, multiplications, and divisions by an integer 
constant. In addition, a simple demand-driven symbolic evaluation scheme 
is implemented. It propagates an expression backwards over a control Vow 
graph until either the value of the expression becomes known or a predeEhed 
propagation limit is reached. 



5.2 Region Numbering 

uring the propagation and manipulation of regular array regions, GARK, 
and GARW K, each of these data structures may be duplicated several times. 
Operations involving these identical copies can potentially be time-consuming 
if straightforward pattern-matching is conducted. Instead, we introduce a 
region numbering scheme. In this scheme, the same number is assigned to 
the identical copies of an array region. Figure 5.1 shows an example. Initially, 



5 = [p, A(l:100)l, 6=Rp, A(l:99)], 2 






3 = [T, A(l:100)l, 2 




4 = [T, A(l:99)l, 2 



m = 100 m = 99 




1 = [T,A(l:m)], 2 = [T,A(l:n)] 

Fig. 5.1. Propagation Using Region Numbers 



there are two GARM, numbered 1 and 2. As GAR No.l is propagated through 
both branches, it is modiEted and becomes GAR No. 3 in the left branch 
and No. 4 in the right branch, while GAR No. 2 remains unchanged. At the 
entrance to the IF, GAR No. 3 is modiEfed by adding predicate p and becomes 
GAR No. 5, while GAR No. 4 becomes GAR No. 6. Since GAR No. 2 appears 
in both branches (the number 2 appears in both lists), the IF condition is 
not attached to it. Thus GAR No. 2 is unchanged. 

However, we do not try to use the same number to represent identical 
array regions which originate from di,[,erent array references in the program, 
because that may require us to compare many array regions from time to 
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time. Instead, we use a simplifier to remove redundant array regions. The 
simplifier is invoked at the end of a routine summarizing or immediately 
before the data dependence tests and array privatization tests. 

Predicates are handled similarly. We assign each leaf node and NAND 
node a unique number to facilitate the operations on predicates. 

5.3 Range Operations 

In this subsection, we give details of range operations for step values of other 
than one. These operations invoke the mm(ei, 62) and the max(ei, 62) func- 
tions. Each function either returns a symbolic expression containing neither 
min nor max function calls, or it returns an “unknown” value. Symbolic 
analysis is performed on demand. 

Given two ranges r\ and r2, ri = {h : u\ : si), r2 = {h ■ U2 '■ S2), the 
following gives the results of range operations: 

1 . If Si = S2 = 1 , 

- ri r2 = 

{max{l\,l2) : min{u\,U2) '■ si) 

— Assuming T2 < ri (otherwise use ri 0 r2 = ri 0 ri r2), we have 
ri 0 T2 = (/i : max{h, ^2) 0 I : si) ~ {min{ui,U2) -|- 1 : ui : si) 

— Union operation. If {I2 > ui + si) or {h > U2 + S2), ri ~ r2 can- 
not be combined into one range. Otherwise, ri 2sr2 = {min{l\,l2) '■ 
max{u\,U2) : si), assuming ri and X2 are both valid. 

The result is returned to the upper level routine, in which whether to 
perform these operations, keep them in a list, etc., is decided. 

2 . If Si = S2 = c > I, where c is a known constant value, we do the following: 
If {l\ 0^2) is divisible by c, then we use the formulas in case I to compute 
intersection, difference and union. Otherwise, ri T2 = and ri0r2 = ri. 
The union ri csr2 usually cannot be combined into one range and must 
be maintained as a list of ranges. 

3 . If Si = S2 and h = I2 (which may be symbolic expressions), 

then we use the formulas in case 1 to perform intersection, difference and 
union. 

4 . If Si is divisible by S2, we check if T2 covers ri. If so, we have ri V2 = r\ 
and ri ~r2 = r2. Otherwise, we divide V2 into several smaller ranges with 
step Si and then apply the above formulas. 

5 . In all other cases, the results of intersection and difference are marked as 
unknown and the union remains a list of two ranges. 



6. Application to Array Privatization and Preliminary 
Experimental Resnlts 



In this section, we first discuss how our array data flow summary can be used 
to perform array privatization even in difficult cases which are sensitive to IF 
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conditions. We then present preliminary experimental results to show both 
the capability and the efficiency of our analysis. 



6.1 Array Privatization 

Array privatization has been shown to be one of the most important tech- 
niques to improve the performance of parallelizing compilers [3,4,12,21]. This 
technique creates a distinct copy of an array for each processor such that stor- 
age conflicts can be eliminated without violating program semantics. Priva- 
tizable arrays appear in real programs quite often and usually serve as tem- 
porary working spaces. It has been shown that traditional data dependence 
tests cannot identify privatizable arrays. In contrast, data flow analysis, when 
supported by IF condition handling, symbolic analysis, and interprocedural 
analysis, allows a powerful compiler to recognize privatizable arrays. 

An array A is a privatization candidate in a loop L if its elements are 
overwritten in different iterations of L (see [17]). STich a candidacy can be 
established by examining the summary array MODi set: If the intersection 
of MODi and MOD^i is nonempty, then A is a candidate. A privatization 
candidate is privatizable if there exist no loop-carried flow dependences in L. 
For an array A in a loop L with an index I, if MOD^i UEi = , then there 
exists no flow dependence carried by loop L. 

Let us look at Figure 2.1 again. In Figure 4.2, we have computed MODi 
and UEi- Since MODi is not loop- variant, we have MOD^i = MODi. Hence, 
MODi MOD^i is nonempty and array A is a privatization candidate. 
Furthermore, 

UEi MOD^i 

= %P, (jmax)], < [T, {jlow : jup)] >0 ([T, {jlow : jup)] ~ [p, {jmax)]) 

= %P, (j"iaa;)], < [T, (jlow : jup)] >0 [T, {jlow : jup)] ~ 

%p, (jmax)], < [T, {jlow : jup)] >0 [p, {jmax)] 

= %P, {jmax)], < [T, {jlow : jup)] >0 [T, {jlow : jup)] ~ 

%[p, {jmax)] [p, {jamx)]), < [T, {jlow : jup)] >0 
= %P, (jmax)], < [T, {jlow : jup)] >0 [T, {jlow : jup)] 



The last difference operation above can be easily done because GAR [T, {jlow : 
jup)] is in the difference list. The fact of UEi MOD^i being empty guar- 
antees that array A is privatizable. 



6.2 Preliminary Experimental Results 

We have implemented our array data flow analysis in a prototyping paralleliz- 
ing compiler. Panorama, which is a multiple-pass, source-to-source, Fortran 
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program restructurer [23] . It performs parsing, construction of a hierarchical 
supergraph(HSG) and interprocedural scalar UD/DU chains [1], conventional 
data dependence tests, array data flow analysis and other analyses which sup- 
port memory performance enhancement, and parallel code generation. 

Table 6.1 shows several time-consuming Fortran loops in the Perfect 
benchmark suite which are parallelized after our array data flow analysis 
and array privatization. (Additional transformations such as induction vari- 
able substitution are automatically performed to enable parallelization, whose 
discussions, however, are omitted in this paper.) The last column of Table 6.1 
presents the current status of array privatization achieved by Panorama. All 
arrays listed in Table 6.1 can be privatized under our scheme except array 
RL in the MDG program. This case needs to handle predicates involving sub- 
script variables, which our current implementation is unable to deal with [12]. 



Table 6.1. Experimental Results on Loops with Privatizable Arrays 



Program 


Routine 

/Loop 


Array Names 


Status* 


TRACK 


nimt/300 


P1,P2,P,PP1,PP2,PP,XSD 


yes 


MDG 


interf/1000 


RS,FF,GG,XL,YL,ZL 

RL 


yes 

no 




poteng/2000 


RS,RL,XL,YL,ZL 


yes 


TRFD 


olda/100 


XRSIQ,XIJ 


yes 




olda/300 


XIJKS,XKL 


yes 


OCEAN 


ocean/270 


CWORK 


yes 




ocean/480 


CWORK, CWORK2 


yes 




ocean/500 


CWORK 


yes 


ARC2D 


filerx/15 


WORK 


yes 




filery/39 


WORK 


yes 




stepfx/300 


WORK 


yes 




stepfy/420 


WORK 


yes 



*: Status shows whether these privatizable arrays can be automatically privatized now. 



Table 6.2 shows the cpu time spent on the main parts of Panorama. In 
Table 6.2, parsing time is the time to parse the program once, although 
Panorama currently parses a program three times (the first time for con- 
structing the call graph and for rearranging the parsing order of the source 
files, the second time for interprocedural analysis and the last time for code 
generation). We do not count the time spent on memory performance anal- 
yses in Table 6.2. 

The row “HSG & DOALL Checking” shows the time taken to build the 
HSG and UD/DU chains and to do conventional DOALL checking. The “Ar- 
ray Summary” refers to our array data flow analysis, which is applied only 
to loops whose parallelizability cannot be determined by the conventional 
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Table 6.2. Execution Time (in seconds) Distribution (Timing is acquired on 
SGI Indy workstations with 134MHz MIPS R4600 CPU and 64 MB memory) 



Program 


TRACK 


MDG 


TRFD 


OCEAN 


ARC2D 


#Lines 


3784 


1238 


485 


4343 


3964 


Parsing 


1.78 


0.86 


0.40 


1.26 


2.43 


HSG & DOALL 
Checking 


3.07 


1.72 


0.49 


0.91 


5.16 


Array Summary 


6.01 


2.38 


0.50 


0.36 


27.72 


Code Generation 


1.62 


0.82 


0.22 


0.42 


2.26 


Total 


12.48 


5.78 


1.61 


2.96 


37.56 


f77 -0 


17.54 


12.37 


7.17 


39.32 


37.67 



DOALL tests. Even though the time percentage of array data flow analysis is 
high (about 40%), the total execution time is small (38 seconds maximum). 
As an interesting comparison, the next row marked by “f77 -O” shows the 
time spent by the f77 compiler with option -O to compile the correspond- 
ing FORTRAN program into sequential machine code. This row is used as a 
reference point to better understand the speed of our parallelizing compiler. 
The data suggest that Panorama is quite efficient. 



7. Related Works 

There exist a number of approaches to array data flow analysis. As far as we 
know, no work has particularly addressed the efficiency issue or has presented 
efficiency data. One class of approaches attempt to gather flow information 
for each individual array reference. Feautrier [10] calculates the source func- 
tion which represents the precise array definition points for each array use. 
Maydan, et al. [20, 21] simplify Feautrier’s method by computing a Last- 
Write- Tree (LWT). Duesterwald, et ah, compute the dependence distance for 
each reaching definition within a loop [8]. Pugh and Wonnacott [24] use a 
set of constraints to describe array data flow problems and solve them by an 
extented Fourier-Motzkin variable elimination. Maslov [19], as well as Pugh 
and Wonnacott [24], extends the previous work in this category by handling 
certain IF conditions. Generally, these approaches are intraprocedural and 
do not seem easy to extend interprocedurally. 

The other class of approaches summarizes the array elements accessed by 
a set of array references instead of analyzing individual array references. This 
class extends data structures designed by the early works for flow-insensitive 
array problems, such as regular array sections [5, 15], convex regions [27, 
28], and data access descriptors [2]. Array data flow analysis in this class 
includes works done by Gross and Steenkiste [11], Rosene [25], Li [17], Tu 
and Padua [29], Creusillet and Irigoin [7], and M. Hall, et al. [14]). The 
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work described in this chapter seems to be the only one which considers the 
effect of IF conditions in a general way, although some previous works also 
handle IF conditions under more limited circumstances. Although this class 
of approaches does not provide as many details about reaching-definitions as 
the first one, it handles complex program constructs better and is easier to 
perform interprocedurally. 



8. Conclusion 

Array data flow information is important to successful automatic program 
parallelization. Propagating array data flow summary interprocedurally is an 
effective way to deal with procedure ealls in program parallelization. Most 
existing forms of array data flow summaries are path-insensitive, i.e., they do 
not distinguish summary sets for different program paths. In certain impor- 
tant cases, such path-insensitivity hurts the result of parallelization badly. 

In this chapter, we have proposed a path-sensitive array data flow sum- 
mary, and we have described an efficient scheme to compute such a summary 
and use it for array privatization and program parallelization. Efficient sym- 
bolic processing is integrated into the process of propagating array access 
information. Preliminary results of array privatization suggest that our anal- 
ysis is both powerful and efficient. 
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Chapter 8. Automatic Array Privatization 

Peng Tu and David Padua 

Digital Computer Laboratory, Department of Computer science, University of 
Illinois-Urbana Champaign, Urbana, IL 61801 



Summary. This chapter discusses techniques for automatic array privatization 
developed as part of the Polaris project at the University of Illinois at Urbana- 
Champaign. Array privatization is one of the most important transformations for 
effective program parallelization. 

The array privatization problem is formulated as data flow equations involving 
sets of scalars and array regions. A single-pass algorithm to solve these data flow 
equations is introduced. Finally, this chapter presents a demand-driven symbolic 
analysis algorithm to manipulate array regions whose bounds are represented by 
symbolic expressions. 



1. Introduction 

Memory-related dependence may severely limit the edectiveness of a compiler 
in several areas such as parallelization, load balancing, and communication 
optimization. Fortunately, many memory-related dependences can be elimi- 
nated. One strategy, known as privatization, allocates a copy of variables in- 
volved in memory-related dependences to each processor participating in the 
parallel execution of a program. A similar technique, called expansion [PW86], 
transforms each reference to a particular scalar into a reference to a vector 
element such that each thread accesses a diScrent vector element. When ap- 
plied to an array, expansion creates a new dimension for the array. Previous 
studies have shown that this process of replicating variables is one of the 
most eSective transformations for the exploitation of parallelism [EHLP91]. 

Because the access to a private variable is inherently local, privatiza- 
tion reduces the need for communication and facilitates data distribution. 
Furthermore, because there is a private instance of some variables for each 
active processor, privatization provides opportunities to spread computation 
across the processors and thus improve load balancing [TP93a]. 

In this chapter, we present a compiler algorithm for array privatization. 
The algorithm has been implemented in the Polaris parallelizing compiler 
[BEF+94]. Two important aspects of array privatization are discussed: 

The formulation of the array privatization problem as data Qpw equations 
involving array regions and an algorithm to solve these equations. To sim- 
plify the presentation, it is assumed that the only iterative constructs in 
the source program are DD loops. However, the algorithm can trivially be 
extended to accept other classes of loops. 

A demand-driven symbolic analysis technique to to manipulate array re- 
gions involving symbolic boundaries. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 247-281, 2001. 

Springer-Verlag Berlin Heidelberg 2001 
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The rest of the chapter is organized as follows. Section 2 introduces and 
illustrates array privatization and last-value assignment. Section 3 presents 
the data flow analysis formulation of privatization. Section 4 discusses the 
demand-driven strategy for symbolic analysis and how it is used to support 
the solution of the data flow equations. Related work is discussed in Section 5. 



2. Background 

Data dependence [Ban88] specifies the precedence constraints that a compiler 
must obey during program transformation. These dependences are due either 
to producer/consumer relations {flow dependences) or to memory reuse or up- 
date operations {anti-dependenees and output dependences). The dependences 
of this second class are known as memory-related dependenees. 

Consider the following loop: 

SI: DO I = 1, N 
S2: A(l) = X(I,J) 

S3: DO J = 2, N 

S4: A(J) = A(J-1)+Y(J) 

S5 : ENDDO 

S6: DO K = 1, N 

S7: B(I,K) = B(I,K) + A(K) 

SB : ENDDO 

S9: ENDDO 



Because every iteration of loop SI reuses array A, loop SI cannot be executed 
in parallel. However, there is no producer/consumer relationship between 
loop iterations and, therefore, no flow dependences exist between the itera- 
tions of SI. The memory-related dependences between iterations of loop SI 
can be eliminated by declaring A to be private. In this way, each processor 
participating in the execution of loop SI will have its own copy of array A 
and there will be no reuse of memory. 

A source-to-source parallelizing compiler like Polaris could transform the 
loop in the previous example by adding the following directives: 

C$DIR PARALLEL DO 
C$DIR PRIVATE A(1:N) 

C$DIR LAST VALUE A(1:N) WHEN (I.EQ.N) 



These directives have the following interpretation: 

— The PARALLEL DO directive indicates that the iterations of loop SI may be 
executed in parallel. 
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— The PRIVATE directive declares the privatizable arrays. That is, each pro- 
cessor cooperating in the execution of the loop allocates a local copy for 
each array in the PRIVATE directive before executing any statement in the 
loop. If the PRIVATE array is used before or after the loop, a global copy, 
accessible outside the loop, is also allocated. 

— During the entire execution of an iteration, references to a private array 
are directed to the processor’s local copy. 

— The LAST VALUE directive specifies the conditions under which a processor 
should copy all or part of a private array into the global version of the 
array. After the execution of an iteration has completed, the processor 
checks the last-value assignment condition in the LASTVALUE directive of 
the loop. If the condition is satished, the processor copies the private array 
to the corresponding global array. This operation is called copy- out. 

There are cases in which a private array needs to be initialized with values 
from its global counterpart. This initialization operation is called copy-in. 
The following is a simple dehnition of privatizable array when copy-in is not 
allowed. 

Definition 21. Let A he an array that is referenced in a loop L. We say A 
is privatizable to L if every feteh to an element of A in L is preceded by a 
store to the element in the same iteration of h. □ 

To include copy-in, we can simply relax the condition and define privati- 
zable arrays as those that are either defined within the same iteration before 
they are used or are not defined in any of the iterations preceding the use. 
For simplicity, we will present the algorithm without copy-in. 

Notice that, according to the definition, array A in the following loop is 
privatizable. 



SI: DO I = 1, N 
S2: A(I) = . . . 

S3: . . . = A(I) 

S4: ENDDO 



However, in this case, privatization would not help eliminate any memory 
related dependence. Privatization is profitable when memory- related depen- 
dences are eliminated. More formally, we have the following definition. 

Definition 22. We say that it is profitable to privatize an array A to a loop 
L if different iterations o/L may aeeess the same loeation of A. □ 

To determine the copy-out condition, if every iteration of a loop writes to 
the same array section, the last iteration of the loop needs to write its local 
copy to the global array to preserve the semantics of the original sequential 
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loop. We call these cases static last-value assignments. Once a static last- 
value assignment situation is identified, the code generation pass can either 
generate conditional codes for the last value assignment inside the loop body, 
or peel the last iteration of the loop out of the parallel region. If different 
iterations write to different sections of an array, or if only some iterations 
write to the same section of an array, the last-value assignment usually has 
to be resolved at runtime. 

Consider the following loop 



C$DIR DO PARALLEL 
C$DIR PRIVATE A(1:N) 

C$DIR LAST VALUE A(l) WHEN (I.EQ.N) 
C$DIR LAST VALUE A(2:N) WHEN DYNAMIC 
SI: DO I = 1, N 
S2: A(l) = X(I,J) 

S3: IF (A(l) .GT.O) THEN 

S4: DO J = 2, N 

S5: A(J) = A(J-1)+Y(J) 

S6 : ENDDO 

S7: ENDIF 

S8: ENDDO 



Here, array section A(2:N) is conditionally assigned. Although it can be 
determined that A is privatizable and that it is profitable to privatize it, the 
last-value assignment of A cannot be determined at compile time. We use 
the key word DYNAMIC to specify that runtime resolution techniques will have 
to be used for the array section A(2:N). We call a case like this dynamic 
last-value assignment. 

Runtime resolution, for example, could be based on synchronization vari- 
ables [ZY87]. For the last loop, this approach would create a last-iteration 
variable containing the number of the last iteration that wrote A(2:N). Ev- 
ery iteration that defines A(2:N) will atomically compare its iteration number 
with the value in the last-iteration variable. If its iteration number is larger 
than the value in the variable, the processor stores its iteration number into 
the last-iteration tag and copy-out A(2:N). Otherwise, the assignment is ig- 
nored because a later iteration has already written to A(2:N). 



3. Algorithm for Array Privatization 

3.1 Data Flow Framework 

Data flow analysis examines the flow of values through a program and solves 
data flow problems by propagating information along the paths of the pro- 
gram’s control flow graph. Because private arrays are associated with DO loops 
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in the program, we extend the traditional control flow graph with information 
about the scope of DO loops. 

Definition 31. Let G = (N,E,s) be a control flow graph where N are 
nodes, E are arcs, and s N is the initial node. Let L he a subflowgraph 
corresponding to a DO loop (including all loops nested within it). We de- 
fine CONTROL(L) L as the subset of nodes in L corresponding to the 
loop entry, increment, and test of the loop index. CONTROL(L) identi- 
fies the DO loop index, its limits, and its step. We define the BODY (L) as 
L®CONTROL{L). □ 

When a program has nested loops, the CONTROL and the BODY of 
all inner loops are included in the body of the outer loop. When the control 
flow of an inner loop is not important for the analysis of an outer loop, we 
can collapse the subflowgraph corresponding to the inner loop into one node. 

Definition 32. Let G = {N, E, s) be a flow graph and L be a DO loop in G. 
The COLLAP{G, L) is G with the subflowgraph L collapsed into one node. 
□ 



Given a subflowgraph L corresponding to a loop, we want to determine 
if, for every iteration of the loop, all reaching definitions to an array use 
come from the same iteration. We can do this through def-use analysis. The 
data values to be analyzed are; scalar variables, such as A, B, (7; subscripted 
variables, such as X(l), X(2), X(/); and subarrays (or array sections), such 
as X{1 : N : 1). Subarrays are defined as follows: 

Definition 33. A subarray is a sparse convex polytope representing an array 
region. Each axle of a subarray is a set of inegualities representing the lower 
bound, upper bound, and stride for the corresponding array region. □ 

The notion of subarray we use in this chapter is an extension of the regular 
section used by others [CK88a]. Using subarrays, we can represent triangular 
sections and banded sections, as well as the strips, grids, columns, rows, and 
blocks of an array. For instance, the following examples respectively represent 
a dense upper triangle, grids in the upper triangle, and diagonal of array A. 



(A(I,I:N), [I=1:N]) 
(A(I,I:N:2), [I=1:N:2]) 
(A(I,I), [I=1:N]) 



The privatization algorithm is organized into four phases as shown in 
Figure 3.1. Phases 1 and 2 are discussed next and phases 3 and 4 are explained 
in Sections 3.4 and 3.5. The first phase computes outward exposed definitions 
and uses for each basic block S in the loop body. A definition of variable v 
in a basic block S is said to be outward exposed if it is the last definition of 
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z; in (S'. A use of v is outward exposed if S does not contain a definition of v 
before the use [ZC91]. 

Definition 34. Let S be a basic block and VAR be the set of scalar variables, 
subscripted variables, and subarrays in the program. 

1. DEF{S) ;= VAR : V has an outward exposed definition in S (} 

2. USE{S) := VAR : v has an outward exposed use in S □ 

Henceforth, the term variable will stand for scalar variables, subscripted 
variables, and subarrays. We define MRD,n{S) as the set of variables that 
are always defined upon entering S, and MRDout{S) as the set of variables 
that are always defined upon exiting S. Let pred{S) be the set of immediate 
predecessors of S ignoring all the back edges in the loop’s flow graph. The 
second phase of the privatization algorithm computes MRD,n{S) using the 
following equations: 



MRD\n{S) — It pred{S)^ R^outit) 

MRD^ufiS) = MRDi^S) ^DEE{S) 

The initial solution used by the privatization algorithm for each MRD,n is 
the empty set. Back edges in the graph are ignored because MRD{S) is 
concerned only with the values that are defined in the statements preceding 
S within the DO loop body. Because back edges are deleted, the algorithm 
actually works on a DAG. Back edges of inner DO loops carry information 
needed in the analysis of outer loops, but they are handled by abstraction 
and aggregation, as discussed in the next section. 

Each set defined above, such as DEE, USE, and MRD, is a subset of 
VAR. Hence, the domain of the data flow information set is the powerset 
V{VAR). The union operator is precise in the sense that it will not 
summarize two sets unless the summary set has exactly the same members 
as the two sets. For instance, 1fA(/) ^A{1 : A^)Owill return ^A(J), A(1 : 
unless I [1 : A^], but 1A(1 : N : 2) ^A(2 : N ^ 1 : 2)'0will return A(1:N). 
The intersection operator (i ) is conservative in the sense that it will return 
an empty set <p if it cannot determine the precise intersection of its operands. 
For instance, 1A(/) | A{1 : A^)'0will return unless I [1 : A^]. Because the 
intersection is conservative, there will be some potential loss of information 
at each join node of the program flow graph. Hence, the effectiveness of the 
algorithm will depend on the system’s ability to determine the relationship 
between symbolic variables. 



3.2 Inner Loop Abstraction 

When the algorithm finds a loop nested inside a loop body, it will recursively 
call itself on the inner loop. To hide the control flow structure of an inner 
loop, we define the set of privatizable variables, and extend the previous 
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Algorithm Privatize 

privatize := func{L) 

Input: flowgraph for loop L with back edges removed 
Output: DEF{L), USE{L), PRI{L) 

Phase 1: Collect local information 

foreach statement S BODY (L) in rPostorder do 
if S control(M) for some loop M nested in L then 
! S' is in an inner loop, visit M first 
[DEF{S),USE{S)] \ privatize{M) 

! collapse all nodes in M onto S 
L \ COLLAPiL, M) 
else 

compute local DEF{S), USE{S) 

endif 

endfor 

Phase 2: Solve the MRD Data Flow Equations for each statement 
foreach S BODY {L) initialize MRD{S) \ 4> 
foreach S BODY (L) in rPostorder do 
MRD\n{S) \ it pred(S)M RDout{t) 

MRD,ut{S) \ MRD,4S) ^DEF{S) 

end 

Phase 3: Compute Summary Sets for the Loop Body 

DEFh{L) \ it exits{BODY{L)){MRDoutit)] 

USEy,{L) \ ^ BODYiL){USE{t) 0 MRDUt)) 

PRh{L) \ BODY(L){USE{t) ^DEF{t))) 0 USE^{L) 
PRIl\L) \ DEFy,{L) i PRh(L) 

PRI^^iL) \ PRR{L) 0 PRIi,\L) 

Phase 4- Return aggregated set DEF{L) and USE{L) 
test if it is profitable to privatize PRIh{L) 
determine last value assignment 

[PRP\L), PRI'^y{L)] \ aggregate{PRIl\ PRlf^ , control{L)) 
[DEF{L),U SE{L)\ \ aggregate{DEF\y{L),USE\y{L),ccmtrol{L)) 
return [DEF{L), USE{L)\ 



Fig. 3.1. Algorithm for Identifying Privatizable Arrays 
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definitions from the basic block level to the loop level. We start by defining 
the information for one iteration of the loop. 

Definition 35. Let L be a loop and VAR be the variables in the program. 
We define the following summary sets for BODY [L): 

1. DEFh{L) is the set o/must define variables for one iteration of L. These 
are the must define variables upon exiting the iteration: 

DEFh{L) = i {MRDout{t) : t exits{L)) 

2. USEh{L) is the set o/ possibly outward exposed use variables. It is the 
set of variables used in some statements of L, but which are not in their 
MRDin: 

USEy,{L) = ^USE{t) ^ MRDi^it)) : t BODY{L) 

3. PRR(L) is the set o/ privatizable variables. These are the variables used 
and not exposed to definitions outside the iteration: 

PRh{L) = ^{USE{t) ^DEF{t)) : t BODY{L)<f(g) USEb(L) 



□ 

The summary sets represent the effect of a loop iteration on the data 
flow values. Using the summary sets, we can ignore the structure of the inner 
loops in the analysis of the outer loop. The trade-off is that we have to make 
a conservative approximation and may lose information in the process. 

The set of privatizable subarrays can be enlarged if we allow copy-in of 
their global counterparts’ values. This can be accomplished by privatizing 
subarrays with exposed uses as long as the exposed uses do not overlap with 
subarrays that may be defined in any iteration preceding the one containing 
the use. It is straightforward to compute the set of subarrays that may be 
defined in a loop body. 

In the analysis of the outer loop, we must consider the total effect of an 
inner loop on data flow values. That is, we need to account for the effect 
of back edges and the iteration space of the loop. The summary sets are 
parameterized by the loop index variables. Hence we can account for the 
effect of all the iterations in the loop by aggregating the summary sets across 
the iteration space of the loop. 

Definition 36. Let L be a loop. The aggregated sets: DEF{L),USE{L), 
and PRI{L) contain the subarrays of the corresponding summary sets after 
USEh{L), DEFh{L), and PRIh{L), respectively, are aggregated across the 
iteration space of the loop. □ 

The aggregation process , which is summarized in Phase 4 of the algorithm in 
Figure 3.1, computes the section spanned by each array reference in Lf SE\,{L), 
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DEFh{L), and PRIh{L) across the iteration space. Aggregation can be ac- 
complished by a relatively straightforward processing of the loop index and 
its boundaries. In our representation of variables, a subarray is represented 
as a subscripted variable together with a subscript range. To aggregate a 
subarray, we simply need to add the information about the loop index and 
its boundaries to the subarray expression. For instance, if I is a loop index 
or an induction variable with value [1:N:1], then A(I,J) will be aggre- 
gated as (A(I , J) , [1=1 : N] ) = A(1 :N,J) ; and, A(1 , 1 : 1) will be aggregated 
as (A(I,1:I) , [I=1:N]). 

Because one iteration’s use could sometimes be exposed only to the def- 
initions of some previous iterations of the same loop, a naive aggregation of 
USEh{L) may exaggerate the exposed use set. The reason is that the uses 
covered by the definitions in previous iterations are not exposed to the out- 
side of the loop and, therefore, they should be excluded from the aggregated 
USE{L) set. For instance, in the following loop.. 



LI: 


DO K = 1 


, N 




A(l) = 


1 


L: 


DO I = 


2, N 


SI: 


A(I) 


= A(I-l) 




ENDDO 




L2: 


DO J = 


1, N 




C(K, 


J) = A(J) 




ENDDO 






ENDDO 





the information for one iteration is USE\^[L) = ^A{I (g) 1), i?(J), J<^ and 
DEFh{L) = ^A(/)'C 5 the section defined in all iterations prior to the zth 
iteration is A(2:I-1); and, only in the first iteration is A(I-l) exposed to 
definitions outside the loop. Hence, USE{L) = ^A(1)0 By eliminating the 
exposed uses to the definitions in previous iterations, we identify array section 
A(1 : N) to be privatizable in the outer loop LI. The less precise analysis 
would assume that loop L has exposed use of A(2 : N) and, therefore, would 
not privatize A(1 : N) in the outer loop. 

Because the aggregated sets approximate the total effect of the inner loop, 
in the analysis of the outer loop it is safe to use the aggregated sets, to collapse 
the inner loop into one node, and to ignore the control flow structure of the 
inner loop. 
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3.3 An Example 

We use the following program to illustrate how the algorithm works. 



SUBROUTINE SHUF(A,N2,N,W0RK) 

LI: DO J = 1, N, 2 
L2: DO I = 1, N2 

SI: W0RK(2*I-1) = A(I,J) 

S2: W0RK(2*I) = A(I,J+1) 

ENDDO 

L3: DO K = 1, N2 

S3: A(K,J) = WORK(K) 

S4: A(K,J+1) = W0RK(K+N2) 

ENDDO 
ENDDO 



The algorithm is first called on loop Ll which, in turn, recursively calls the 
the algorithm on inner loops L2 and L3. For each iteration of L2, we find that 
the summary sets of definitions and exposed uses for the body of L2 are: 

DEFb{L2) =WORK{2 0 1), WORK{2 
=WORK{2 0 1 : 2 -/ : !),/<> 

\JSEb{L2) = W, J), A{I, J + 1), JO 
=1A(/,J: J+1),J0 

Therefore, the aggregated sets of definitions and exposed uses for the loop L2 
are: 

DEE{L2) =%WORK{2 0 1 : 2 -I), [I = 1, iV2]), /<> 

=WORK{l : 2 ~7V2 : 1), /<> 

USE{L2) = %A{I, J : J + 1), [/ = 1, iV2]), JO 
=V(1 : N2,J:J + 1),J0 

For each iteration of L3, we find that the summary sets of definitions and 
exposed uses for the body of L3 are: 

DEFb(L3) =nA{K,J),A{K,J + 1),K0 
=%A{K,J : J+1:1),K0 
\JSEb{L3) = WORK{K), WORK{K + N2), JO 

Therefore, the aggregated sets of definitions and exposed uses for the loop L3 
are: 

DEF(L3) =%A{K, J : J + 1 : 1), A = 1, N2), KO 
=V(1 : A2 : 1, J : J + 1 : 1),A0 
U5A(T3) = %WORK{K), WORK{K + N2),K = 1,N2), JO 
=WORK{l : 2~iV2), JO 
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The body of LI is then analyzed. All the definitions in L2 will reach the uses 
in L3. That is, 

KRDi^{L2) = V, WORK{l : 2 ~7V2 : !),/<> 

Hence, the privatizable variables in loop LI are: 

PRIbiLl) = V,WORK{l : 2 -N2 : l),/<> 

The summary sets of definitions and exposed uses for loop Ll are: 

DEFb(Ll) = %A{1 : N2,J : J +1 : l),WORK{l : 2 ~iV2 : 1),/, J,Ar<> 
USEbiLl) = 1|A(1 : 7V2, J : J + 1 : 1)0 

In this example, the summary sets for loop Ll also can be used for depen- 
dence analysis. In general, to prove that a loop L is parallel, all we have to 
prove is that the possibly exposed uses, USEb{L), and the possible definitions 
will never overlap for different values of the loop index. The set of possible 
definitions within loop, DEEmb{L), can be defined in a flow insensitive way 
as: 

DEFmb{L) = ^DEE{t):t BODY{L)). 

In Polaris, the array def-use analysis module generates the DEFmb{L) to 
provide array section information for other passes, such as reduction code 
generation and runtime dependence testing. 

3.4 Profitability of Privatization 

After an array is identified as privatizable in a loop, we need to determine 
if different iterations of a loop will access the same location of the array. 
For instance, as already discussed in Section 2, A (I) is privatizable in the 
following loop: 



SI: DO I = 1, N 
S2: A(I) = . . . 

S3: . . . = A(I) 

S4: ENDDO 



We can privatize A (I) using a private scalar as follows: 



C$DIR INDEPENDENT 
C$DIR PRIVATE X 
C$DIR LAST VALUE A (I) = X 
SI: DO I = 1, N 
S2: X = . . . 

S3: . . . = X 

Sn: ENDDO 
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This transformation is useful for conventional compiler optimization. Today’s 
optimizing compilers usually will not allocate a register to a subscripted vari- 
able A (I) in the original program due to their limited capability in handling 
array references. In the transformed program, it is easier for them to allocate 
a register to a scalar X. 

Privatizing A ( I ) also can reduce the amount of false sharing in multipro- 
cessor caches. In a distributed memory system where the compiler uses the 
owner computes rule [ZBG88, CK88b, RP89], privatization effectively trans- 
fers the ownership of A (I) to the processor executing iteration /; hence, the 
processor scheduled to execute the iteration I can execute operations in S2 
even if it does not own A (I). This transformation can facilitate data distri- 
bution, reduce communication, and improve load balance [TP93a]. 

However, for the purpose of eliminating memory-related dependence, the 
array A in the previous example need not be privatized. We said that it is 
profitable to privatize an array when different iterations of the loop access 
the same location. Whether this condition is satisfied can be determined 
by examining PRI\^(L). We will call the test that examines PRIh{L) the 
profitability test. Let A(r) be a reference to array A where r is a subscript 
expression if A{r) is a subscripted variable, or a range list if A{r) is a subarray. 
We assume that r is either loop invariant or is expressed in terms of the loop 
index, i. 

If A{r) is a subscripted variable and r is a monotonic function of the 
loop index i, then different iterations of i will access different locations of 
A{r); hence, it is not profitable to privatize A{r). Otherwise it is profitable. 
When there is more than one subscript of A in PRR{L), we need to test 
each pair for multiple accesses to the same location in different iterations. 
This can be done using Banerjee’s Test [Ban88] or any other dependence 
test. If A{r) is a subarray, we need to determine if there is an iteration j K^i 
such that A{r) | A{r[i/j]) where r[i/ j\ represents r after we substitute 
each occurrence of loop index i with j . Again, one has to test for each pair of 
occurrences if there is more than one occurrence of subarrays. This discussion 
is summarized in the algorithm shown in Figure 3.2. 

3.5 Last Value Assignment 

Liveness analysis is needed to determine if a privatizable variable is live 
after exiting the loop. If it is live, the last-value assignment will be necessary 
to preserve the semantics of the original program; otherwise, no last-value 
assignment is needed for that variable. 

Definition 37. Let S he a node in the flowgraph. The live variables at the 
bottom of S are the set of variables that may he used after S completes exe- 
cution. We define: 

1. LVBOT{S) := VAR : v may he used after S 0 

2. LVTOP{S) := VAR : v may he used after S or in S □ 
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Algorithm Profitability Test 

Input: PRIh for loop L: with index i [p ■. q : t\ 

Output: PRO, arrays profitable for privatization 

PRO \ <l> 

foreach A[r) PRR do 

ALL A \ nA{r) : A{r) PRRO 

foreach pair A{x), A{y) ALL a — where x and y can be the same 
let X \ set of values in x 
let Y \ set of values in y 
if [P'-q- ^i,X[i/j] i Y ^cj)) 

PRO \ PRO + A 

INotice that if x = y and x does not contain i, the test is satisfied. 

endfor 

endfor 



Fig. 3.2. Profitability Test 



Let succ{S) be the set of immediate successors of S in the program flow- 
graph. The equations for LVTOP and LVBOT are: 

LVBOT{S) = ^ succ(S)LVTOP{t) 

LVTOP{S) = (LVBOT{S) 0 DEF{S)) ^USE(S) 

The array def-use analysis computes the aggregated sets for each loop in 
the program. We can use these aggregated sets in the liveness analysis and 
ignore the control flow graph inside the loop body. For each loop L in the 
program, we can instead use the following equations to reduce the amount of 
work: 

LVBOT{L) = ^ succ(L)LVTOP{t) 

LVTOP{L) = (LVBOTiL) 0 DEF{L)) ^USE(L) 

For each loop L, KILL{L) contains the must written variables, and the 
USE[L) contains the possibly exposed use by loop L. They are both conser- 
vative for liveness analysis. 

The data flow equations for liveness analysis can be solved using an it- 
erative algorithm that traverses the control flow graph backwards. This is a 
natural extension of scalar live analysis to array liveness analysis. 

After live analysis, we can ignore the last-value assignments for private 
arrays that are not live at the bottom of the loop. However, the remaining live 
private arrays have to be copied to their global counterparts. Two problems 
prevent the static determination of the iteration that copies its private array 
to the global array. One, as shown earlier in Section 3, is due to conditional 
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definitions. Without information about which branch the program will take 
at runtime, it is impossible to determine which iteration assigns the last 
value. Another problem is that some complicated subscript expressions make 
it inefficient to compute at compile time which iteration will assign the last 
value. In these cases, we can use well-known runtime techniques, such as that 
described in Section 2. 

Our first step is to identify the private arrays that need dynamic last- value 
assignments because of conditional definitions. PRIh contains all the array 
uses that are covered by some definition in the same iteration of the loop. 
Some of the uses are conditional; that is, they are covered by some conditional 
definitions. DEF\^ contains all the variables that must be defined as a function 
of the iteration number. Therefore, PRI^ = PRIh i DEFh contains the 
privatizable arrays that are unconditionally defined. Hence PRI^^ = PRIh 0 
PRI^ contains the conditionally defined privatizable arrays. 

Because of the profitability test, at least one element of each array in 
PRI^ is defined in two or more iterations. To determine for each iteration 
what element has to be copied back to the global array, we define a write 
back set as the sections of private array that have to be copied back to the 
global array for iteration i. 

Definition 38. Let L be a loop body and PRP* be the statie private arrays. 
The Write Back Set (WBS) of L is defined as the sections of arrays in PRP* 
that are written in the ith iteration, but are not written thereafter. □ 

,;,Prom the definition, we can compute the WBS by comparing the set de- 
fined in iteration i and the set defined in the iterations after i. The algorithm 
is shown in Figure 3.3. 



Algorithm Write Back Set 

Input: PRIf^ for loop L: with index i [p ■. q : t\ 
Output: WBS 

WBS \ 

foreach array A PRI^ do 
ALL A \ IfA(r) : A(r) 

WBS\ALLa^^ [i+t:q-.t]I^LLA[i/j] 

endfor 



Fig. 3.3. Compute Write Back Set 



Note that the last iteration of loop L will always write back all its static 
private arrays. When we cannot find a closed form for WBS, we can move 
the array to PRI^^ and use runtime resolution. Actually the algorithm itself 
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can be linked into the program to perform a run test for each iteration. In 
most cases, the algorithm will find a closed form and, therefore, WBS can 
be determined at compile time. The following loop will be used to illustrate 
how the algorithm in Figure 3.3 works in two different situations. 



51 

52 

53 

54 

55 



DO I = 1, N 
DO J = 1, M 
A(J) = . . . 
B(I+J) = . . . 
ENDDO 



Sn 



ENDDO 



For loop SI, PRI^ = ^A(l : +1:1 + M)0 A(1:M) will be 

accessed in all iterations after a given KN because A(1;M) does not depend 
on I. Hence, WBS for A in iteration I is (f), the empty set. Only the last 
iteration of loop SI will copy out A(1 :M) . For B, B(I+1 : 1+M) is in ALLb for 
iteration I; B( (I+D+l :M+N) is modified in iterations from I+l to N. Hence, 
the WSBi for B in iteration I is B(I+1). 



4. Demand-Driven Symbolic Analysis 

To evaluate its effectiveness, the privatization algorithm just described was 
implemented in Polaris. We now present a comparison of the number of pri- 
vate arrays found by the algorithm with the number of private arrays found 
by hand in the Perfect Benchmarks as reported in [EHLP91]. The result is 
shown in Table 4.1. The first column reports the number of private arrays 
identified by both manual and automatic privatization. The second column re- 
ports the number of private arrays identified by manual privatization but not 
by automatic privatization. The third column reports the number identified 
by automatic privatization but not by manual privatization. By comparing 
the results of automatic privatization and manual privatization, it is clear 
that the algorithm is sufficient to discover most of the privatizable arrays. 
The representation for subarray sections is also adequate for representing the 
array use and definition in the programs of the Perfect Benchmarks. The 
algorithm can successfully handle all the privatizable arrays in FLO 52 and 
TRFD. In the programs BDNA, DYFESM, FLO 52, and MDG, the algorithm 
finds some privatizable arrays that are not found by manual array privatiza- 
tion. One reason for this is that finding privatizable arrays is tedious work 
that requires a lot of effort. A compiler is more reliable and consistent in 
handling the mechanical part of the task. Another reason is that in the man- 
ual privatization, the programmers used runtime profile information to select 
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Program 


Automatic 
and Manual 


Manual 

Only 


Automatic 

Only 


ADM (AP) 


2 


12 


0 


ARC2D (SR) 


0 


2 


0 


EDNA (NA) 


12 


3 


4 


DYFESM (SD) 


0 


1 


11 


FL052 (TF) 


0 


0 


4 


MDG (LW) 


17 


1 


1 


MG3D (SM) 


1 


4 


0 


OCEAN (OC) 


4 


3 


0 


QCD (LG) 


22 


7 


0 


SPEC77 (WS) 


25 


14 


0 


TRACK (MT) 


20 


2 


0 


TRFD (TI) 


4 


0 


0 



Table 4.1. Number of Private Arrays 



target loops for potential parallelization and ignored the privatizable arrays 
in the other loops. 

However, there were still many privatizable arrays that the privatization 
algorithm failed to identify. We found that in most instances where our algo- 
rithm failed, it was due to lack of information about symbolic variables. To 
increase the coverage of the algorithms, it seems necessary to use more so- 
phisticated techniques for determining the equivalence of symbolic variables, 
interprocedural symbolic values and bounds propagation, and conditional 
data flow analysis. 

To illustrate the need for more sophisticated techniques, consider the fol- 
lowing program segment: 



IF (P) THEN JLOW =2, JUP = JMAX - 1 
ELSE JLOW =1, JUP = JMAX 
L: DO K = 1, N 

WORK (JLOW: JUP) = ... 

IF (P) THEN ...= W0RK(2: JMAX-1) 

ENDDO 

Loop L cannot be executed in parallel because each of its iterations reads and 
writes the same elements of array WORK. The array privatization algorithm 
tries to determine if it is correct to allocate a private copy of array WORK 
to each iteration of the loop. This transformation is safe when no iteration 
uses any data in the array WORK that is computed by other iterations. The 
WORK array is privatizable in the loop if we can prove that the section read, 
WORK (2 : JMAX-1) , is covered by the section written, WORK (JLOW: JUP) . That 
is, we need to prove that JLOW is less than or equal to 2 and that JUP is 
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greater than or equal to (JMAX-1) . Later in this chapter, we will present two 
ways to prove the above symbolic relations. 

In this section, we present a demand-driven technique to propagate and 
analyze symbolic expressions as well as a technique to perform conditional 
flow analysis. These techniques are also useful in areas other than array pri- 
vatization. In fact, recent developments in parallelizing compilers have re- 
sulted in the increased use of the symbolic analysis technique to facilitate 
parallelism detection and program transformation. Several research compil- 
ers, such as Parafrase-2 [HP92] and Nascent [GSW], use symbolic analysis to 
identify and transform induction variables. In the Polaris [BEF+94] restruc- 
turing compiler, symbolic analysis is used, in addition to array privatization, 
for dependence analysis and symbolic range propagation. 

4.1 Gated Single Assignment 

The symbolic analysis techniques discussed below are based on an extension 
of the Static Single Assignment (SSA) form [CFR+91] of program representa- 
tion, known as the Gated Single Assignment (GSA) form [BMO90]. In SSA, 
each definition of a variable is given a unique name. Fach use of a renamed 
variable can only refer to a single reaching definition. Where several defini- 
tions of a variable, xi,X 2 , ■ ■ ■ ,Xm, reach a confluence point in the CFG of 
the program, a <j) function assignment statement, \ 4 '{xi,X 2 , ■ ■ ■ ,Xm), is 
inserted to merge them into a new variable definition The condition un- 
der which a definition reachs a join node is not represented in the </)-function. 
In the GSA representation, several types of gating functions are defined to 
represent the different types of conditions at different join nodes. Some extra 
parameters are introduced in the gating functions to represent the conditions. 
GSA introduces three new gating functions: 

— 7 function : Replaces those (?f)-functions at join nodes associated with IF 
statements. A 7 function includes the predicate of an if statement as an 
additional argument. 

— jjL function : Replaces cf functions at the head of a loop. It also includes an 
extra argument to represent the loop header’s control condition. 

— rj function : Replaces (f functions at the exit of a loop. It selects the last 
value produced by a ^ function. 

Several algorithms exist for converting programs into GSA form [BMO90, 
Hav93,TP94]. Using the GSA representation, we can represent the value of a 
symbolic variable in an expression containing other variables, constants, and 
gating functions. 

Traditionally, symbolic expressions for each variable have been con- 
structed by applying forward substitution during symbolic execution of the 
program. To apply forward substitution, the compiler follows the def-use 
chains and substitutes each use of a variable with the symbolic expression 
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that was assigned to the variable. The symbolic value of a variable is usually 
represented as a function of program inputs. When multiple definitions reach 
the same use of a variable, the variable will have multiple possible symbolic 
values. A compiler’s ability to represent multiple possible symbolic values 
determines the types of analysis the compiler can perform. In the GSA rep- 
resentation, multiple possible reaching definitions and the conditions guard- 
ing the definitions are represented in the arguments gating functions. Gating 
functions provide a natural way to represent an expression’s multiple possible 
symbolic values under different conditions. 



4.2 Demand-Driven Backward Substitution 

Another advantage of using a single assignment form is that the use-def 
chains are embedded in the unique variable names. This representation of 
use-def chains provides an opportunity to perform demand-driven analysis. 
Demand-driven analysis is desirable for analysis that arises sparsely. In a 
parallelizing compiler, the requirement of symbolic analysis is sparse. Many 
simple programs or parts of a program do not need symbolic analysis. Full- 
scale forward substitution is expensive. Much of the information generated 
and propagated by forward substitution is never used. Furthermore, in many 
cases, representing everything in terms of program inputs is unnecessary, as 
illustrated next. Consider the following code segment: 



R: JMAX = Expr 

S: IF (P) THEN J = JMAX 

ELSE J = JMAX 
T: assert{ J ^ JMAX) 

To determine whether the assertion {J JJ. JMAX) is true at T, we need 
to know the symbolic value of J. Forward substitution starts at state- 
ment R. Once it completes, J and JMAX at statement T are replaced by 
{if P then Expr 0 1 else Expr) and Expr, respectively. Thus, the boolean 
expression ( J J]- JMAX) evaluates to be true. It is easy to see that the sub- 
stitution of JMAX by Expr is unnecessary. In a large program, forward substi- 
tution could produce long and complex expressions. Therefore, determining 
whether an assertion is true could be very time consuming. Approximate 
summary information could be used to improve the efficiency of this process. 
However, in general, approximation decreases the accuracy of analysis. 

In a demand-driven approach, we seek information only when it is needed. 
Instead of propagating all symbolic values forward, our demand-driven strat- 
egy is goal directed and moves backward. Given a symbolic expression, we 
backward substitute arguments in the expression. The baekward substitution 
stops when enough information to satisfy a specific objective has been ob- 
tained. In a forward substitution strategy, the requirements are not known. 
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Therefore, it is very difficult, or impossible, to determine which subset of the 
available information to propagate and where to start the propagation. 

As an example, consider the following GSA representation of the last code 
segment: 



R: 3MAX\ = Expr 

S: IF {pj:HEN Ji = JMAXi ® 1 

ELSE J 2 = JMAXi 
S’: J3 = j{P,Ji,J2) 

T: assert (J3 JJ. JMAXi) 

A demand-driven analysis starts at T and performs backward substitution 
following the SSA links of the variables in the expression. The intermediate 
statements, which do not affect the variables used in T, are skipped. The 
steps of the substitution are: 



J3— J2) 

=7(P, JMAXi ^ 1, JMAXi) 

The backward substitution stops at this point because enough information 
for proving J 3 JJ- JMAXi has been obtained. The redundant substitution of 
jMAXi by Expr is avoided. 

During backward substitution, if there are several arguments that can be 
back-substituted, we need to decide which argument should be substituted 
first. Our strategy is as follows. Given two arguments, u and v, that can 
be back- substituted, if the assignment statement of u (Su) dominates the 
assignment statement of v (Sy) in the program control flow graph CFG, then 
V should be back-substituted before u. Because Su dominates Sy, the value 
of u cannot depend on the value of v, but the value of v may depend on the 
value of u. When there are loops in the program, this order may not be valid 
since u can potentially depend on the v through some back edges. In the case 
of loops, we use a known technique to identify the set of variables that belong 
to the same strongly connected region (SCR). If Sy and Sy belong to different 
SCRs, the order is then determined by the dominance relation between the 
SCRs. 

When comparing two partially back-substituted symbolic expressions, s 
and t, we can use the dominator tree to determine which argument in s or t 
should be substituted first. If the arguments are substituted from the bottom- 
up in the dominator tree, then it is possible to avoid expanding an expression 
beyond what is necessary for the comparison. This simple algorithm is shown 
in Figure 4.1. 

For instance, in the last code segment, when comparing J 3 with jMAXi, 
the assignment statement R dominates the assignment statements S , S’, and 
T. Hence, J 3 is substituted first. Ji and J 2 are substituted next. The substi- 
tuted expression 'y{P, JM AXi 0 1, JM AXi) is comparable with jMAXi. 
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Algorithm Unify 
Input: expressions s and t 

1. Mark constants and matching arguments in s, t as dead. 

2. while =^ctive arguments s,t do 

Substitute an argument whose assignment is the lowest in the 
dominator tree. 

Mark constants and matching arguments in the resulting expressions 
as dead. 



Fig. 4.1. Ordering Backward Substitution 



Backward substitution stops after it is determined that the expressions in- 
volved are comparable. 

4.3 Backward Substitution in the Presence of Gating Functions 

To derive the value of a variable at a point S' in a program, we first perform 
backward substitution to obtain a symbolic expression. The resulting sym- 
bolic expression SE contains: literals, such as constants or variables; normal 
functions and operators, such as -|-, abs, . . . ; and the gating functions ry, 
7 and /i. During backward substitution, gating functions are treated in the 
same way as normal functions. The gating functions in a symbolic expression 
represent the different possible values of the expression under different condi- 
tions. Hence, we will call an expression a predicated expression if it contains 
gating functions. 

The purpose of demand-driven symbolic analysis is to determine the val- 
ues of a variable at a specific point of the program CFG. A predicated expres- 
sion captures the conditional possible values of the expression under different 
conditions. The conditional possible values of a predicated expression can be 
refined if the necessary conditions for the control flow to reach the point of 
interest in the program CFG are taken into account. 

Definition 41. A symbolic path condition PC is a predicate specifying the 
control flow conditions under which the program flow will reach the statement 

S. 



For instance, in the following program: 



IF (P) THEN Jt/Po = JMAX 0 1 
ELSE Jt/Pi = JMAX 
JUP2=j{P,JUPo,JUPi) 

IF (P) THEN 
S : . . . = JUP 2 
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the path condition that controls the execution of statement S is (P) . This 
path condition can be used to refine the possible values of JUP 2 at S. After 
backward substitution, the predicated expression of JUP 2 is 'y{P, JMAX 0 
1, JMAX). That is, JUP 2 has two possible values, jMAX 0 1 and jMAX, 
depending on the value of P. Because the path condition at S is (P = true), 
the value of JUP 2 at S can be determined to be JMAX-1. 

To compute the path condition for each statement, we need to use the 
concept of iterative control dependences [FOW87]. 

Definition 42. The iterative control dependences of a node X is the transi- 
tive closure of its control dependences. 

Hence, if X is control dependent on Y, and Y is control dependent on Z, 
then X is iteratively control dependent on Z. The iterative control depen- 
dences of a CFG node S specify the branch nodes that determine whether 
the control flow will reach S. 

If S is iteratively control dependent on a collection C of branch state- 
ments, its path condition PC can be represented as a boolean expression 
that contains the branch conditions in C. The path-restricted value PV of a 
symbolic expression at statement S is the projection of the possible values of 
SE-. 



PV = SE{PC) 

To compute the projection, we can use the following rules: 

SE{PC) = SE if SE contains no gating functions 

Vt{PC) if PC ^P 

j{P,Vt,Vf)iPC) = Vf{PC) if PC ^\P 

^{P,Vt{PC),Vf{PC)) otherwise (unknown) 

fr{L,VinitiViter){PC) = p.(L,Vinit{PC) 

1 ^iter {PC)) 

r^{P,V){PC) = V{P/PC) 

4.4 Examples of Backward Substitution 

We next present some examples of backward substitution and path projec- 
tion. The examples illustrate how these techniques improve the effectiveness 
of array privatization. The techniques also are useful to improve the accuracy 
of dependence analysis [BE94a, BE94b] . 
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Consider the following code segment: 



DIMENSION XE(IOOOO) 



S: 


NDFEq = 


NDDEq -NNPEDq 


D: 


DO i = l, 


NDFEq 




XE{i) = 
ENDDO 




U: 


DO i = l, 


NDDEq 




DO j = 


1, NNPEDq 



... = XE{{i 0 1) -NNPEDo + j) 

ENDDO 

ENDDO 

To prove that the array section Xi?(l : NDFEq) defined in loop D covers 
the array section X£'(l : NDDEq c^NNPEDo) accessed in loop U, we need 
to prove that iiDFEo cx NDDEq ^NNPEDq. This is easily done after 
NDFEq is replaced by NDDEq ^NNPEDq using backward substitution. 
The path condition for those points within loop U where Xi? is accessed is 
PCq = {NDDEq oc 1 N^NNPEDq oc 1). The path condition at those points 
where XE is defined is PC^ = [NDFEq oc 1) or, after backward substitution, 
PCq = {NDDEq ^NNPEDq oc 1). It is easy to see that PCq <^PCo and, 
therefore, whenever loop U has a non-zero trip count, loop D also has a non- 
zero trip count. 

We now illustrate the use of the projection rule for 7 functions. Back- 
ward substitution involving the jj, function and associated recurrences will be 
discussed in the next section. Consider the following code segment: 



IF (P) THEN JLOWq = 2, JUPq = JMAX 0 1 
ELSE JLOWi = 1, JUPi = JMAX 
JLOW 2 = t(P, JLOWq, JLOWi) 

JUP2 = j{P,JUPq,JUPi) 

L: DO . . . 

D: assign to array section NORK {JLOW 2 '■ JUP 2 ) 

U: IF (P) THEN use array section NORK{2 : JMAX ®1) 

ENDDO 

For the array WORK to be private to the loop L, we need to determine 
that the use of NORK {2, JM AX 0 1) at U is covered by the definition of 
NORK {J LOW 2 : JUP 2 ) at D. The PC at U is P. Using the projection rule for 
the 7 function under the condition P, we obtain the following values which 
prove the desired coverage: 
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JLOW^j{P, JLOWo, JLOWi){P) 

=JLOWo{P) 

=2 

JUP 2 =-tiP,JUPo,JUPi){P) 

=JUPo{P) 

=JMAX ® 1 

4.5 Bounds of Symbolic Expression 

Sometimes it is sufficient to know the possible values of a variable with- 
out looking at the predicates of the gating functions. In array privatization, 
knowing the upper and lower bounds of a variable often can prove the desired 
property. 

A predicated expression contains the possible symbolic values and the 
conditions for the expression assuming the values. If we ignore the predicates 
in the gating functions, the rest of the expression represents only the possible 
values of the expression. To estimate the bounds of a symbolic expression, 
we often can ignore the predicates that are difficult to resolve with path 
projection and, instead, apply the minimum and maximum functions directly 
to the possible values: 

max{j{P,Vt,Vf)) Jj. max{Vt,Vf) 
min(7(P, Ft, k/)) oc mm{Vt,Vf) 

Using these two rules, we can obtain the following for the last code seg- 
ment: 



max{JLOW 2 ) |lmax( 7 (P, JLOWo, JLOWi)) 
=max\jLOWo,JLOWi) 

=max{2, 1) 

=2 

min{JUP 2 ) (xmin{j{P,JUPo,JUPi)) 

=min{JU P q, JUPi) 

=min{JMAX ® 1, JMAX) 

=JMAX <S) 1 . 

This result also proves the coverage property needed for privatization. 

4.6 Comparison of Symbolic Expressions 

The symbolic expression may still contain 7 functions after path projec- 
tion. In symbolic analysis, we sometimes need to compare these expressions. 
Alpern, Wegman, and Zadeck [AWZ88] define a congruence relation between 
expressions containing (p assignments. 
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Definition 43. Two expressions are congruent if and only if: 

— They have the same gating functions, and 

— The arguments of the gating functions are congruent. 

The congruent variables are shown to have equivalent values at a node p 
if both of them dominate p. To determine if two expressions are congruent, 
we need to transform the expressions into some sort of canonical form. In 
many cases, pattern matching alone will not be sufficient. Rewriting trans- 
formations, such as constant folding, arithmetic simplification and normal- 
ization, and value numbering [ASU 86 ], are standard techniques to normalize 
the expressions into a canonical form. For example, (2 -|- 5) is folded to 7; 
(X -\-X -\-Y is simplified to (2 zsX -\-2zX); and, (2~T + 2csX) is rewritten 
to (2 zsX + 2zsY). 

Congruence relation can be used only to determine equality; it cannot de- 
termine, for example, if one expression is always larger than another. We next 
define a class of expressions whose inequality relationship can be determined 
at compile-time. 

We loosely call two expressions compatible if, after backward substitution 
and normalization, the non-constant literals in one expression are a subset of 
the terms in the other. Only when two expressions are compatible can their 
relationship be determined symbolically. Two congruent expressions are also 
compatible. For the purpose of comparison, two compatible expressions, 
and E'^, can be classified as follows: 

1. E^ and E“^ contains no gating functions: The expressions can be 
compared by simplifying E^ (g) E“^ symbolically. Their relationship can be 
determined if the result is a constant. 

2. Only one of E^ and E'^ contains gating functions: The compari- 
son will be based on the arguments of the 7 function. To illustrate the 
method, assume that E“^ = j{P, Vt, Vf), where Vt and Vf may also con- 
tain gating functions. To determine whether E^ > E“^ , we reduce it to 
case 1 using the following necessary and sufficient condition: 

{E^ > 7(P, Ft, Vf)) / > V) AE^ > Vf) 

If Ft) F/ contains any gating function, the same procedure should be used 
recursively on the right-hand side of the above equation. The result then 
can be evaluated as in case I. An equivalent approach is to compute 
the minimum and maximum values for using the technique discussed 
above. Because E^ does not contain any gating function, it can be proven 
that: 

{E^ > p2) / A > max(p2)) 

3. Both E^ and contain gating functions: There are several ways to 
handle this case. We will illustrate just one here. Assume again that = 
7 (P, Vt, Vf). To prove E^ > the necessary and sufficient condition is: 
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> 7(P, Vf) / (E\P) > Vt) AE\\P) > Vf) 

Because E^ contains a gating function, in the above equation the path 
projections E^{P) and P^(\P) are necessary to cast the branching con- 
ditions to E^ . For instance, if E^ = ^[P,Vt !^/)i condition can be 
evaluated as follows: 

(7(P, F, , VAP) > Vt)Al(P, Vt , ^/)(\P) > Vf) = A > Vt)AVf > Vf) 

Each application of this rule eliminates one gating function. It is ap- 
plied recursively until the right-hand side is free of gating functions. The 
problem is then reduced to case 2 . 

We will call these three rules distribution rules. Rule 3 subsumes rule 2 
because path projection has no effect on an expression that does not contain 
any gating function. Note that for determining equalities, techniques based 
on structural isomorphism cannot identify equalities when the order of the 7 
functions in two expressions is different. Using the distribution rules, we can 
identify those equalities. 

The distribution rules discussed also apply to more complex expres- 
sions. Because the gating functions capture only the predicates for condition 
branches, regular functions can be safely distributed into arguments of the 
gating functions. The rule is as follows: 

E{g{P, X, Y),Z) = Q{P, P{X, Z),P{Y, Z)) 

where P is a regular function, and 1/ is a gating function. For example, 
7 (P, X, y)^ -|- expr = y(P, -|- expr, -|- expr). 

Using this rule, we can move a gating function to the outermost position 
of an expression. For example, consider = j{P, Vt, Vf) + exp. This expres- 
sion can be transformed to y(P, Vt + exp{P), Vf + exp{\P)). In the analysis 
algorithm, this transformation is deferred until distribution has been applied 
to the 7 function in order to allow the common components in E^ to be 
cancelled out with 'j{P,Vt,Vf) or exp. 

The following are some examples of how to use these rules. We used rule 
2 in Section 4.2: 



J3=7(P, JMAXi ® 1, JMAXi) 

JJ. JMAXi 

In the following example, we use rule 3 to derive the condition for JUP 2 > 
JLOW 2 under P in the last example of Section 4.5: 

{JUP 2 > JLOW 2 )/ j{P, JMAX ^ 1, JMAX) > 7 (P, 2, 1) 

/ 7(P, JMAX ® 1 > 2, JMAX > 1) 

/ 7(P, JMAX > 3, JMAX > 1) 
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When two predicated expressions contain many predicated conditional 
values, using distribution rule 3 can cause the expressions size to grow rapidly. 
In our experience, restricting the nesting levels of gating functions to 2 seems 
to be sufficient for most analysis. For levels larger than 2, the expressions usu- 
ally are not comparable. For those complicated expressions, we can compute 
the bounds and ignore the predicates by using the following approximation 
rule: 

(min(i?^) > max(i?^)) > E^) 

4.7 Recurrence and the /j. Function 

Backward substitution of an expression involving a /i function will form a 
recurrence because of the back edge in a loop. The value carried from the 
previous iteration of a loop is placed in the third argument of a /i function. 
Hence, the rule for the substitution of a /i function is to substitute the third 
argument of the /i function until it becomes an expression of the variable 
itself or an expression of another /i assigned variable. This is illustrated in 
the following example: 



L: DO I = 1, N 

Vl=^i{iI=l,N),Jo,J3)0 
IF (P) THEN J2 = Ji + A 
■I3 = j{P, J 2 , Ji) 

ENDDO 

Substituting the third argument in the n function leads to: 



= 1,N), Jo, Ls) 

= I, N), Jo, -/{P,J 2 ,Jl)) 

=m((/= l,iV),Jo,7(P, Ji+AJi)) 

The recurrence can be interpreted as the following A-function over the loop 
index. 

M.{Ji{i)) f j{i = 1, Jo,j{P, Ji{i 0 1) -F H, Ji(i 0 1))) 

This A-function can be interpreted in the terms of a recursive sequence in 
combinatorial mathematics as follows: 



Jo — Jo 



Ji = 



Ji01 + 1 
Ji6^1 



if (P) 
otherwise 



for i 



[l:iV] 



After the mathematical form of a recurrence is identified, standard methods 
for solving a linear recurrence can be used to find a closed form. For instance, 
if P is always true and A is loop invariant, then Ji(f) is an induction variable 
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with a closed form of Jo + (i < 8 > 1) csA. When P is always true and A is a linear 
reference to an array (e.g., A=X(i)), then Ji(f) is a sum reduction over X. 

Wolfe and others [Wol92] developed a comprehensive technique to clas- 
sify and solve recurrence sequences using a graph representation of SSA. In 
this graph representation, a recurrence is characterized by a Strongly Con- 
nected Region (SCR) of use-def chains. The backward substitution technique 
for /i functions is equivalent to Wolfe’s SCR approach. Their technique for 
computing the closed form can be directly used in our scheme. However, we 
believe that our backward substitution scheme, which works directly with the 
algebra of names, functions, and expressions, is better than their approach 
that works indirectly through graph and edges. We also can deal with the 
cases where no closed form expression can be obtained. For instance, if the 
condition P is loop invariant, symbolic substitution can determine that the 
value of Ji(f) is either Jq -P (i 0 1) —A or Jq. 



4.8 Bounds of Monotonic Variables 

Because of conditional branches, a variable can be conditionally incremented 
by different amounts in different iterations of a loop. These variables cannot 
be represented in a non-recursive form. However, knowing that a variable is 
monotonically increasing or decreasing is sometimes useful for dependence 
analysis. 

When the closed form of a recurrence cannot be determined, it still may 
be possible to compute a bound by selecting the 7 argument that has the 
maximal or minimal increment to the recurrence. In the last example 



L: DO I = 1, N 

Vi = = 1 ,N),Jo,J3)0 

IF (P) THEN J2 = Ji + A 

J3 = l{P, J2, Ji) 

ENDDO 

the bounds for Ji can be obtained as follows: 



max(Ji) = 1,N), Jq,j{P, Ji + A, Ji))) 

=^((/ = 1, V), Jo,max{^{P, Ji + A, Ji))) 

=fi((I = l,JV),Jo,Ji + A) 
min( Ji) ccmin(fi((I = 1, N), Jq, ^f{P, Ji + A, Ji))) 

=j{{I = ^,N), Jo,min{j{P, Ji -P A, j/))) 

=m((i = i,JV),Jo,Ji) 

The resulting two recurrence functions now can be solved to obtain the upper 
bound Jo -p V and the lower bound Jq. 




274 Peng Tu and David Padua 



4.9 Index Array 

The use of array elements as subscripts makes dependence analysis and array 
privatization more difficult than when only scalars are used. When the values 
of an index array depend on the program’s input data, runtime analysis must 
be used. However, in many cases, the index arrays used in the program are 
assigned symbolic expressions. For instance, a wide range of applications use 
regular grids. Although the number of grids is input dependent, the structure 
of the grid is fixed. The structure is statically assigned to index arrays with 
symbolic expressions of input variables. In these cases, the value of an index 
array can be determined at compile-time. Consider the following segment of 
code: 



L: DO J=l, JMAX 

JPLC/5'(J) = J+ I 
ENDDO 

JPLUS{JMAX) = Q 

U: ... 

It is possible to determine at compile-time that the array element JPLU S{J) 
has the value of J -|- 1 for J [1, JMAX (g) 1] and Q for J = JMAX . We can 
use the GSA representation to find out the value of JPLUS(J) at statement 
U. To this end, we use an extension of the SSA representation to include 
arrays [CFR+91]. This extension is obtained by applying the following three 
transformations: 

1. Create a new array name for each array assignment; 

2. Use the subscript to identify which element is assigned; and 

3. Replace the assignment with an update function a {array, subscript, 
value). 

For example, an assignment of the form JPLUS(I) = exp will be converted 
to JPLUSi = a{JPLUSo, I,exp). The semantics of the a function is that 
JPLU S\{I) receives the value of exp while the other elements of JPLU Si will 
take the values of the corresponding elements of JPLU Sq. This representation 
maintains the single assignment property for array names. Hence, the def-use 
chain is still maintained by the links associated with unique array names. 
Using this extension, the last loop can be transformed into the following 
GSA form: 



L: DO J = 1, JMAX 

VPLUS 2 = p{{J = 1, JMAX), JPLUSo, JPLUSi)0 
JPLU Si = a{JPLUS 2 , J,J + 1) 

ENDDO 

JPLU S 3 = a{JPLUS 2 , JMAX, Q) 
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To determine the value of an element JPLUS-s{K) in JPLUS 3 , we can use 
backward substitution as follows: 

JPLUS 3 {K)= a{JPLUS 2 , JMAX, Q){K) 

= -f{K = JMAX,Q,JPLUS 2 {K)) 

= -i[K = JMAX, Q, ii[{J = 1, JMAX), JPLUSo, JPLUSi){K)) 

= 7(A'=JMAA,Q,7(1 <K < JMAX, JPLUSi{K),JPLUSo{K))) 
= 7(7f= JMAX, Q, 7(1 <K < JMAX,K + l,JPLUSo{K))) 

In the preceding evaluation process, an expression of the form (X‘ h exp){j) 
is evaluated to Q, j — vexp‘X{j)). An expression of the form 9{L^Y‘ Z){j) 
is instantiated to a list of Q functions that select diSerent expressions for 
diSerent values of .7. These evaluation rules are straightforwardly derived from 
the deVnitions of the gate functions. To avoid unnecessary array renaming, 
GSA conversion can be performed only on those arrays that are used as 
subscripts. 



4.10 Conditional Data Flow Analysis 

Conditionally deVned and used arrays can be privatized if the condition at 
the use site subsumes the condition at the deVnition site. In this section, we 
present a technique that uses gating functions to incorporate conditions into 
a data Q)w analysis framework. 

Traditional data Qjw analysis ignores the predicate that determines if a 
conditional branch is taken. Therefore, it cannot propagate the conditional 
data Q)w information. Consider the following example: 



DO I = 1, N 
IF (P) THEN 
Dl: A(1:M) = . . . 

ELSE 

D2: B(1 :M) = . . . 

ENDIF 

U: IF (P) THEN 

Ul: . . .= A(1:M) 

ELSE 

U2: . . .= B(1:M) 

ENDIF 

ENDDD 

In the data Q)w analysis, the must-reach deVnitions at the ends of statements 
Dl and D2 are: 



MRDoutiDl) ^ W : M)M0 

MRDout(D2)^nB(l :M)M0 
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Hence, the must-reach definitions at the top of join node U is obtained by: 

MRDUU) = MRDout{Dl) \ MRDout{D2) 

= %A{l:M),IOi 

= V0 

Because the conservative intersection operator | does not contain the condi- 
tional flow information, the conditional reaching definition is lost at the join 
node. To solve the problem of the conservative intersection operator and to 
propagate the conditional reaching definitions, we will use the gating function 
at the join node U as the intersection operator | c- That is, 

MRD^U) = MRDout{Dl) | g MRDout{D2) 

= 7(P, MRD,^t{Dl), MRDout{D2)) 

= V,7(P,H(1:M),P(1:M))0 

The conditional reaching definitions represented by the gating expression 
then can be propagated in the control flow graph. At use sites U1 and U2, the 
control dependence predicate P can be used to prove that the use conditions 
always subsume the definition conditions. 



4.11 Implementation and Experiments 

The techniques for predicated conditional value analysis, bounds analysis 
of monotonic variables, index array analysis, and the conditional data flow 
analysis using gated intersection have been implemented in the Polaris par- 
allelizing compiler. Currently, they serve to provide symbolic information for 
conditional array privatization. 

Table 4.2 shows the improvement we obtained using the symbolic analysis 
for conditional array privatization. The columns in the table show the number 
of privatizable arrays identified with and without symbolic analysis. Of the six 
programs tested, four show an increased number of privatizable arrays. Also 
shown are: the increases in parallel loop coverage expressed as a percentage 
of the total execution time; the speedups obtained on an 8 processor SGI 
Challenge; and the techniques applied in each program. 



Program 


Pri. 
(w/o SA) 


Pri. 
(with SA) 


Increased 
Coverage % 


Speedups 
(8 proc) 


Cond. 

Value 


Bounds 

Recurrence 


Index 

Array 


ARC2D 


0 


2 


15.3 


4.5 


X 


X 


X 


EDNA 


16 


19 


33.6 


4.5 


X 


X 


X 


FL052 


4 


4 


0.0 


2.6 








MDG 


18 


19 


97.7 


4.9 


X 




X 


OCEAN 


4 


7 


42.7 


1.2 


X 






TRFD 


4 


4 


0.0 


2.7 









Table 4.2. Effect of Symbolic Analysis on Array Privatization 
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The experiment shows that a small increase in the number of private 
arrays can make a big difference in the parallelism discovered. In MDG, pri- 
vatization with symbolic analysis privatizes only one more array than that 
without symbolic analysis, yet it makes a loop that accounts for 97 percent 
of the program execution time to be parallel. Out of the six programs, four 
require demand-driven symbolic analysis techniques to parallelize their im- 
portant loops. These loops account for 15 to 97 percent of the execution time 
of the programs. Sequential execution of these loops will prevent any meaning- 
ful speedup. These results show that advanced symbolic analysis techniques 
are very important to parallelize these applications. 

Among the analysis techniques, the predicated conditional value is used 
most often, followed by statically assigned symbolic index array analysis. The 
bounded recurrence column shows only the cases where there is no closed 
form for the recurrence due to conditional increment. Induction variables 
with closed forms occur in all these programs. Currently, Polaris uses other 
techniques for induction variable substitution. The effect of induction variable 
substitution on array privatization is not shown here. 

The demand-driven symbolic analysis part of the privatizer requires less 
than 5 percent of the total execution time of the privatizer, primarily be- 
cause symbolic analysis is infrequently required. The small increase in time 
is justified by the additional parallel loops we can identify using the advanced 
techniques. A command-line parameter is used in Polaris to control the num- 
ber of nested 7 functions allowed in any predicated expression. Setting this 
parameter makes it possible to restrict the number of nested 7 functions to 
the limit specified by the switch. 

The array privatizer in Polaris has been converted to work directly on 
the GSA representation. Using the GSA representation of use-def chains, the 
privatizer shows an average speedup of 6 when compared with the original 
implementation based on the control flow graph. 



5. Related Work 

Array privatization was first identified as one of the most important transfor- 
mations for parallelizing large applications in [EHLP91]. This is not surpris- 
ing, because the usual programming practice in conventional languages is to 
reuse memory. Scalar expansion has long been used in vectorizing compilers 
to remove scalar reuse induced dependences. When the granularity of par- 
allelism is increased to the outer loops, memory-related dependences caused 
by array reuse become an important obstacle to parallelization. 

Previous work on eliminating memory-related dependence focused on 
scalar expansion [Wol82], scalar privatization [BCFH89], scalar renaming 
[CF87], and array expansion [PW 86 ] [Fea 88 ]. Recent work in array induced 
memory-related dependence includes array privatization based on depen- 
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dence analysis [MAL92], and array privatization based on array data flow 
analysis [TP93a, TP93b,Li92]. 

In symbolic analysis, some of the techniques designed in the past are 
based on path values (i.e., the set of symbolic values of a variable on all pos- 
sible paths reaching a program point.) Corresponding to each path there is 
a path condition (i.e., a boolean expression that is true if the path is exe- 
cuted) [CR85, CHT79]. These techniques have exponential time and space 
requirements that limit their applicability in practical situations. 

The GSA representation of symbolic expression in this chapter is a new 
way to represent predicated path values. With GSA, conditions and values 
are represented in a compact way. Expression in GSA can be manipulated 
like normal arithmetic expressions except that special treatment is needed for 
gating functions. We also use the iterative control dependences to represent 
path conditions. 

Symbolic analysis is also related to the problem of automatic proof of in- 
variant assertions in programs. Symbolic execution can be considered as ab- 
stract interpretation [CH92]. In [CH78], abstract interpretation is used to dis- 
cover the linear relationships between variables. It can be used to propagate 
symbolic linear expressions as possible values for symbolic variables. How- 
ever, the predicate guarding the conditional possible values is not included in 
the representation and cannot be propagated. The parafrase-2 compiler uses 
a similar approach to evaluate symbolic expressions. It uses a join function 
to compute the intersection of possible values at the confluence nodes of the 
flow graph to cut down the amount of maintained information. Again, the 
predicated conditional information is lost in the join function. Abstract in- 
terpretation also has been used in [AH90] for induction variable substitution. 

The demand-driven approach in this chapter is a special case of the more 
general approach of Subgoal Induction [MW77]. The internal representation 
chosen facilitates the demand-driven approach. (1) In the SSA representation, 
the variable name is unique and the use-def chain is embedded in the unique 
variable name. (2) In GSA, backward substitution can be done by following 
the use-def chain in the name without going through the flow graph. Using the 
GSA sparse representation and the demand-driven approach, our approach 
can perform more aggressive analysis only on demand. 

The SSA form has been used to determine the equivalence of symbolic 
variables and to construct global value graph in a program [AWZ88] [RWZ88]. 
Nascent [Wol92] uses SSA to do a comprehensive analysis of recurrences. Its 
internal representation does not include the gating predicate. The SSA repre- 
sentation in Nascent is achieved through an explicit representation of use-def 
chains, which they call a demand-driven form of SSA. Their approach for con- 
structing strongly connected regions for recurrences in the graph is similar 
to our backward substitution of names. GSA also has been used in Paras- 
cope to build the global value graph [Hav93]. Our demand-driven backward 
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substitution in GSA and our use of control dependences to project values are 
unique. 

Currently, the most used algorithm for building SSA form is given in 
[CFR+91]. GSA was introduced in [BMO90] as a part of the Program Depen- 
dence Web (PDW), which is an extension of the Program Dependence Graph 
(PDG) [FOW87]. An algorithm for constructing GSA from PDG and SSA 
is given in [BMO90]. Havlak [Hav93] develops another algorithm for build- 
ing GSA from a program control flow graph and SSA. Recently, the authors 
developed an algorithm to efficiently construct the GSA directly from the 
control flow graph [TP94]. Control dependence was introduced in [FOW87] 
as part of PDG. The SSA construction paper [CFR+91] presents an algorithm 
for building control dependences. 
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Summary. This paper presents a theoretical framework for automatically parti- 
tioning parallel loops and data arrays for cache-coherent NUMA multiprocessors 
to minimize both cache coherency traffic and remote memory references. While 
several previous papers have looked at hyperplane partitioning of iteration spaces 
to reduce communication traffic, the problem of deriving the optimal tiling pa- 
rameters for minimal communication in loops with general affine index expressions 
has remained open. Our paper solves this open problem by presenting a method 
for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal 
communication in multiprocessors with caches. Our framework uses matrices to 
represent iteration and data space mappings and the notion of uniformly intersect- 
ing references to capture temporal locality in array references. We introduce the 
notion of data footprints to estimate the communication traffic between proces- 
sors and use linear algebraic methods and lattice theory to compute precisely the 
size of data footprints. We show that the same theoretical framework can also be 
used to determine optimal tiling parameters for both data and loop partitioning 
in distributed memory multicomputers. We also present a heuristic for combined 
partitioning of loops and data arrays to maximize the probability that references 
hit in the cache, and to maximize the probability cache misses are satisfied by the 
local memory. We have implemented this framework in a compiler for Alewife, a 
distributed shared memory multiprocessor. 



1. Introduction 

Cache-based multiprocessors are attractive because they seem to allow the 
programmer to ignore the issues of data partitioning and placement. Because 
caches dynamically copy data close to where it is needed, repeat references 
to the same piece of data do not reqiiire communication over the network, 
and hence reduce the need for careful data layout. However, the performance 
of cache-coherent systems is heavily predicated on the degree of temporal 
locality in the access patterns of the processor. Loop partitioning for cache- 
coherent multiprocessors is an eEbrt to increase the percentage of references 
that hit in the cache. 

The degree of reuse of data, or conversely, the volume of communication of 
data, depends both on the algorithm and on the partitioning of work among 
the processors. (In fact, partitioning of the computation is often considered 
to be a facet of an algorithm.) For example, it is well known that a matrix 
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multiply computation distributed to the processors by square blocks has a 
much higher degree of reuse than the matrix multiply distributed by rows or 
columns. 

Loop partitioning can be done by the programmer, by the run time sys- 
tem, or by the compiler. Relegating the partitioning task to the programmer 
defeats the central purpose of building cache-coherent shared-memory sys- 
tems. While partitioning can be done at run time (for example, see [1,2]), it 
is hard for the run time system to optimize for cache locality because much 
of the information required to compute communication patterns is either un- 
available at run time or expensive to obtain. Thus compile-time partitioning 
of parallel loops is important. 

This paper focuses on the following problem in the context of cache- 
coherent multiprocessors. Given a program consisting of parallel do loops 
(of the form shown in Fig. 2.1 in Sect. 2.1), how do we derive the optimal 
tile shapes of the iteration-space partitions to minimize the communication 
traffic between processors. We also indicate how our framework can be used 
for loop and data partitioning for distributed memory machines, both with 
and without caches. 

1.1 Contributions and Related Work 

This paper develops a unified theoretical framework that can be used for loop 
partitioning in cache-coherent multiprocessors, for loop and data partitioning 
in multicomputers with local memory, and for loop and data partitioning in 
cache coherent NUMA multiprocessors. The central contribution of this paper 
is a method for deriving an optimal hyperparallelepiped tiling of iteration 
spaces to minimize communication. The tiling specifies both the shape and 
size of iteration space tiles. Our framework allows the partitioning of doall 
loops accessing multiple arrays, where the index expressions in array accesses 
can be any affine function of the indices. 

Our analysis uses the notion of uniformly intersecting references to cate- 
gorize the references within a loop into classes that will yield cache locality. 
This notion helps specify precisely the set of references that have substantially 
overlapping data sets. Overlap produces temporal locality in cache accesses. 
A similar concept of uniformly generated references has been used in earlier 
work in the context of reuse and iteration space tiling [3,4]. 

The notion of data footprints is introduced to capture the combined set 
of data accesses made by references within each uniformly intersecting class. 
(The term footprint was originally coined by Stone and Thiebaut [5].) Then, 
an algorithm to compute precisely the total size of the data footprint for 
a given loop partition is presented. Precisely computing the size of the set 
of data elements accessed by a loop tile was itself an important and open 
problem. While general optimization methods can be applied to minimize the 
size of the data footprint and derive the corresponding loop partitions, we 
demonstrate several important special cases where the optimization problem 
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is very simple. The size of data footprints can also be used to guide program 
transformations to achieve better cache performance in uniprocessors as well. 

Although there have been several papers on hyperplane partitioning of 
iteration spaces, the problem of deriving the optimal hyperparallelepiped tile 
parameters for general affine index expressions has remained open. For ex- 
ample, Irigoin and Triolet [6] introduce the notion of loop partitioning with 
multiple hyperplanes which results in hyperparallelepiped tiles. The purpose 
of tiling in their case is to provide parallelism across tiles, and vector process- 
ing and data locality within a tile. They propose a set of basic constraints 
that should be met by any partitioning and derive the conditions under which 
the hyperplane partitioning satisfies these constraints. 

Although their paper describes useful properties of hyperplane partition- 
ing, it does not address the issue of automatically generating the tile param- 
eters. Careful analysis of the mapping from the iteration space to the data 
space is very important in automating the partitioning process. Our paper 
describes an algorithm for automatically computing the partition based on 
the notion of cumulative footprints, derived from the mapping from iteration 
space to data space. 

Abraham and Hudak [7] considered loop partitioning in multiprocessors 
with caches. However, they dealt only with index expressions of the form 
index variable plus a constant. They assumed that the array dimension was 
equal to the loop nesting and focused on rectangular and hexagonal tiles. 
Furthermore, the code body was restricted to an update of A[i,j], 

Our framework, however, does not place these restrictions on the code 
body. It is able to handle much more general index expressions, and pro- 
duce parallelogram partitions if desired. We also show that when Abraham 
and Hudak’s methods can be applied to a given loop nest, our theoretical 
framework reproduces their results. 

Ramanujam and Sadayappan [8] deal with data partitioning in multicom- 
puters with local memory and use a matrix formulation; their results do not 
apply to multiprocessors with caches. Their theory produces communication- 
free hyperplane partitions for loops with affine index expressions when such 
partitions exist. However, when communication-free partitions do not exist, 
they can deal only with index expression of the form variable plus a con- 
stant offset. They further require the array dimension to be equal to the loop 
nesting. 

In contrast, our framework is able to discover optimal partitions in cases 
where communication free partitions are not possible, and we do not restrict 
the loop nesting to be equal to array dimension. In addition, we show that 
our framework correctly produces partitions identical to those of Ramanujam 
and Sadayappan when communication free partitions do exist. 

In a recent paper, Anderson and Lam [9] derive communication-free par- 
titions for multicomputers when such partitions exist, and block loops into 
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squares otherwise. Our notion of cumulative footprints allows us to derive 
optimal partitions even when communication- free partitions do not exist. 

Gupta and Banerjee [10] address the problem of automatic data parti- 
tioning by analyzing the entire program. Although our paper deals with loop 
and data partitioning for a single loop only, the following differences in the 
machine model and the program model lead to problems that are not ad- 
dressed by Gupta and Banerjee: (1) The data distributions considered by 
them do not include general hyperparallelepipeds. In order to deal with hy- 
perparallelepipeds, one requires the analysis of communication presented in 
our paper. (2) Their communication model does not take into account caches. 
(3) They deal with simple index expressions of the form ci i + C 2 and not a 
general affine function of the loop indices. 

Our work complements the work of Wolf and Lam [3] and Schreiber and 
Dongarra [11]. Wolfe and Lam derive loop transformations (and tile the itera- 
tion space) to improve data locality in multiprocessors with caches. They use 
matrices to model transformations and use the notion of equivalence classes 
within the set of uniformly generated references to identify valid loop trans- 
formations to improve the degree of temporal and spatial locality within 
a given loop nest. Schreiber and Dongarra briefly address the problem of 
deriving optimal hyperparallelepiped iteration space tiles to minimize com- 
munication traffic (they refer to it as I/O requirements). However their work 
differs from this paper in the following ways: (1) Their machine model does 
not have a processor cache. (2) The data space corresponding to an array 
reference and the iteration space are isomorphic. These restrictions make the 
problem of computing the communication traffic much simpler. Also, one of 
the main issues addressed by Schreiber and Dongarra is the atomicity require- 
ment of the tiles which is related to the dependence vectors and this paper 
is not concerned with those requirements as it is assumed that the iterations 
can be executed in parallel. 

Ferrante, Sarkar, and Thrash [12] address the problem of estimating the 
number of cache misses for a nest of loops. This problem is similar to our 
problem of finding the size of the cumulative footprint, but differs in these 
ways: (1) We consider a tile in the iteration space and not the entire iteration 
space; our tiles can be hyperparallelepipeds in general. (2) We partition the 
references into uniformly intersecting sets, which makes the problem compu- 
tationally more tractable, since it allows us to deal with only the tile at the 
origin. (3) Our treatment of coupled subscripts is much simpler, since we look 
at maximal independent columns, as shown in Sect. 5.2. 

1.2 Overview of the Paper 

The rest of this paper is structured as follows. In Sect. 2 we state our sys- 
tem model and our program-level assumptions, in Sect. 3 we first present a 
few examples to illustrate the basic ideas behind loop partitioning; we then 
discuss the notion of data partitioning, and when it is important. In Sect. 4 
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we develop the theoretical framework for partitioning and presents several 
additional examples. In Sect. 5 we extend the basic framework to handle 
more general expressions; Sections 6 and 7 indicate extensions to the basic 
framework to handle data partitioning and more general types of systems. 
The framework for both loop and data partitioning has been implemented 
in the compiler system for the Alewife multiprocessor. The implementation 
of our compiler system and a sampling of results is presented in Sect. 8, and 
Sect. 9 concludes the paper. 



2. Problem Domain and Assumptions 

This paper focuses on the problem of partitioning loops in cache-coherent 
shared- memory multiprocessors. Partitioning involves deciding which loop 
iterations will run collectively in a thread of computation. Computing loop 
partitions involves finding the set of iterations which when run in parallel 
minimizes the volume of communication generated in the system. This section 
describes the types of programs currently handled by our framework and the 
structure of the system assumed by our analysis. 

2.1 Program Assumptions 

Fig. 2.1 shows the structure of the most general single loop nest that we 
consider in this paper. The statements in the loop body have array references 
of the form where the index function is g : Z’’ Z‘^, I is 

the loop nesting and d is the dimension of the array A. We have restricted our 
attention to doall loops since we want to focus on the relation between the 
iteration space and the data space and factor out issues such as dependencies 
and synchronization that arise from the ordering of the iterations of a loop. 
We believe that the framework described in this paper can be applied with 
suitable modifications for loops in which the iterations are ordered. 

We assume that all array references within the loop body are uncondi- 
tional. One of the two following approaches may be taken for loops with 
conditionals. 

— Assume that all array references are actually accessed, ignoring the condi- 
tions surrounding a reference. 

— Include only references within conditions that are likely to be true based 
on profiling information. 

We address the problem of loop and data partitioning for index expres- 
sions that are afhne functions of loop indices. In other words, the index func- 
tion can be expressed as, 

g(i) = 



iG -I- a 



( 2 . 1 ) 
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Doall (il=ll:ul, i2=12:u2, il=ll:ul) 

loop body 
EndDoall 



Fig. 2.1. Structure of a single loop nest 



where G is a / matrix with integer entries and a is an integer constant 
vector of length d, termed the offset vector. Note that i, g(i), and a are 
row vectors. We often refer to an array reference by the pair (G,a). (An 
example of this function is presented in Sect. 3). Similar notation has been 
used in several papers in the past, for example, see [3,4]. All our vectors 
and matrices have integer entries unless stated otherwise. We assume that 
the loop bounds are such that the iteration space is rectangular. The prob- 
lem with non-rectangular tiles is one of load balancing (due to boundary 
effects in tiling) and this can be handled by optimizing for a machine with 
a large number of virtual processors and mapping the virtual processors to 
real processors in a cyclic fashion. 

Loop indices are assumed to take all integer values between their lower 
and upper bounds, i.e, the strides are one. 

Previous work [7,8,13] in this area restricted the arrays in the loop body to 
be of dimension exactly equal to the loop nesting. Abraham and Hudak [7] 
further restrict the loop body to contain only references to a single array; 
furthermore, all references are restricted to be of the form A[ii + «i,i 2 + 
tt 2 , ■ ■ ■ ,id + 0 Ld\ where aj is an integer constant. Matrix multiplication is a 
simple example that does not fit these restrictions. 

Given P processors, the problem of loop partitioning is to divide the 
iteration space into P tiles such that the total communication traffic on the 
network is minimized with the additional constraint that the tiles are of 
equal size, except at the boundaries of the iteration space. The constraint 
of equal size partitions is imposed to achieve load balancing. We restrict 
our discussions to hyperparallelepiped tiles, of which rectangular tiles are a 
special case. 

Like [7, 8, 13], we do not include the effects of synchronization in our 
framework. Synchronization is handled separately to ensure correct behavior. 
For example, in the doall loop in Fig. 2.1, one might introduce a barrier 
synchronization after the loop nest if so desired. We also note that in many 
cases fine-grain data-level synchronization can be used within a parallel do 
loop to enforce data dependencies and its cost approximately modeled as 
slightly more expensive communication than usual [14]. See Appendix B for 
some details. 
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2.2 System Model 

Our analysis applies to systems whose structure is similar to that shown 
in Fig. 2.2. The system comprises a set of processors, each with a coherent 
cache. Cache misses are satisfied by global memory accessed over an inter- 
connection network or a bus. The memory can be implemented as a single 
monolithic module (as is commonly done in bus-based multiprocessors) , or in 
a distributed fashion as shown in the figure. The memory modules might also 
be implemented on the processing nodes themselves (data partitioning for lo- 
cality makes sense only for this case). In all cases, our analysis assumes that 
the cost of a main memory access is much higher than a cache access, and 
for loop partitioning, our analysis assumes that the cost of the main memory 
access is the same no matter where in main memory the data is located. 




CPU 



CPU 



CPU 



CPU 



Fig. 2.2. A system with caches and uniform-access main memory (UMA). 



The goal of loop partitioning is to minimize the total number of main 
memory accesses. For simplicity, we assume that the caches are large enough 
to hold all the data required by a loop partition, and that there are no 
conflicts in the caches. Techniques such as sub-blocking described in [15] or 
other techniques as in [16] and in [17] can be applied to reduce effects due to 
conflicts. When caches are small, the optimal loop partition does not change, 
rather, the size of each loop tile executed at any given time on the processor 
must be adjusted [15] so that the data fits in the cache (if we assume that 
the cache is effectively flushed between executions of each loop tile). Unless 
otherwise stated, we assume that cache lines are of unit length. The effect of 
larger cache lines can be included easily as suggested in [7], and is discussed 
further in Sect. 6.2. 

If a program has multiple loops, then loop tiling parameters can be chosen 
independently for each loop to optimize cache performance by applying the 
techniques described in this paper. We assume there is no data reuse in the 
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cache across loops. In programs with multiple loops and data arrays, tiling 
parameters for each loop and data array cannot be chosen independently in 
systems where the memories are local to the processors (see Fig. 3.3). This 
issue is discussed further in Sect. 7. 



3. Loop Partitions and Data Partitions 

This section presents examples to introduce and illustrate some of our defini- 
tions and to motivate the benefits of optimizing the shapes of loop and data 
tiles. More precise definitions are presented in the next section. 

As mentioned previously, we deal with index expressions that are affine 
functions of loop indices. In other words, the index function can be expressed 
as in Equation 2.1. Consider the following example to illustrate the above 
expression of index functions. 

Example 31. The reference A[i^ + 2,5,t2 0 1,4] in a triply nested loop can 
be expressed by 

0 0 0 0 

(^l,^2,^3) 0 0 1 0 V +(2,5,®1,4) 

1 0 0 0 I 

In this example, the second and fourth column of G are zero indicating 
that the second and fourth subscripts of the reference are independent of the 
loop indexes. In such cases, we show in Sect. 5.2 that we can ignore those 
columns and treat the referenced array as an array of lower dimension. In 
future, without loss of generality, we assume that the G matrix contains no 
zero columns. 

Now, let us introduce the concept of a loop partition by examining the 
following example. Loop partitioning specifies the tiling parameters of the 
iteration space. Loop partitioning is sometimes termed iteration space parti- 
tioning or tiling. 



Example 32. 

Doall (1=101:200, j = l:100) 

A[i,j] = B[i+j ,i-j-l]+B[i+j+4,i-j+3] 

EndDoall 

Let us assume that we have 100 processors and we want to distribute 
the work among them. There are 10,000 points in the iteration space and so 
one can allocate 100 of these to each of the processors to distribute the load 
uniformly. Fig. 3.1 shows two simple ways of partitioning the iteration space 
- by rows and by square blocks - into 100 equal tiles. 
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Partition a Partition b 

Fig. 3.1. Two simple rectangular loop partitions in the iteration space. 



Minimizing communication volume requires that we minimize the number 
of data elements accessed by each loop tile. To facilitate this optimization, 
we introduce the notion of a data footprint. Footprints comprise the data 
elements referenced within a loop tile. In other words, the footprints are 
regions of the data space accessed by a loop tile. In particular, the footprint 
with respect to a specific reference in a loop tile gives us all the data elements 
accessed through that reference from within a tile of a loop partition. 

Using Fig. 3.2, let us illustrate the footprints corresponding to a reference 
of the form B [i+j , i-j-1] for the two loop partitions shown in Fig. 3.1. The 
footprints in the data space resulting from the loop partition a are diagonal 
stripes and those resulting from partition b are square blocks rotated by 
45 degrees. Algorithms for deriving the footprints are presented in the next 
section. 

Let us compare the two loop partitions in the context of a system with 
caches and uniform- access memory (see Fig. 2.2) by computing the number of 
cache misses. The number of cache misses is equal to the number of distinct 
elements of B accessed by a loop tile, which is equal to the size of a loop tile’s 
footprint on the array B. (Sect. 6.1 deals with minimizing cache- coherence 
traffic). Caches automatically fetch a loop tile’s data footprint as the loop 
tile executes. For each tile in partition a, the number of cache misses can be 
shown to be 104 (see Sect. 5.1) whereas the number of cache misses in each 
tile of partition b can be shown to be 140. Thus, because it allows data reuse, 
loop partition a is a better choice if our goal is to minimize the number of 
cache misses, a fact that is not obvious from the source code. 

When is data partitioning important? Data partitioning is the problem of 
partitioning the data arrays into data tiles and assigning each data tile to a 
local memory module, such that the number of memory references that can be 
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Footprints for loop partition a 



Footprints for loop partition b 



Fig. 3.2. Data footprints in the data space resulting from loop partitions a 
and b 



satisfied by the local memory is maximized. Data partitioning is relevant only 
for nonuniform memory-access (NUMA) systems (for example, see Fig. 3.3). 




; CPU : I CPU i : cpu i : cpu 




Fig. 3.3. Systems with nonuniform main-memory access time, (a) A cache- 
coherent NUMA system, (b) A distributed memory multicomputer. 



In systems with nonuniform memory-access times, both loop and data 
partitioning are required. Our analysis applies to such systems as well. The 
loop tiles are assigned to the processing nodes and the data tiles to memory 
modules associated with the processing nodes so that a maximum number of 
the data references made by the loop tiles are satisfied by the local memory 
module. Note that in systems with nonuniform memory-access times, but 
which have caches, data partitioning may still be performed to maximize the 
number of caches misses that can be satisfied by the memory module local 
to the processing node. 
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Referring to Fig. 3.2, the footprint size is minimized by choosing a diag- 
onal striping of the data space as the data partition, and a corresponding 
horizontal striping of the iteration space as the loop partition. The addi- 
tional step of aligning corresponding loop and data tiles on the same node 
maximizes the number of local memory references. 

In fact, the above horizontal partitioning of the loop space and diagonal 
striping of the data space results in zero communication traffic. Ramanujam 
and Sadayappan [8] presented algorithms to derive such communication-free 
partitions when possible. On the other hand, in addition to producing the 
same partitions when communication-traffic-free partitions exist (see Sec- 
tions 5.1 and 6), our analysis will discover partitions that minimize traffic 
when such partitions are non-existent as well (see Example 45). 



Example 33. 

Doall (i=l:N, j = l:N) 

A[i,j] = B[i,j] + B[i+l,j-2] + B[i-l,j+l] 

EndDoall 

For the loop shown in Example 33, a parallelogram partition results in 
a lower cost of memory access compared to any rectangular partition since 
most of the inter iteration communication can be internalized to within a 
processor for a parallelogram partition (see Sect. 8.1). Because rectangular 
partitions often do not minimize communication, we would like to include 
parallelograms in the formulation of the general loop partitioning problem. 
In higher dimensions a parallelogram tile generalizes to a hyperparallelepiped; 
the next section defines it precisely. 



4. A Framework for Loop and Data Partitioning 

This section first defines precisely the notion of a loop partition and the no- 
tion of a footprint of a loop partition with respect to a data reference in 
the loop. We prove a theorem showing that the number of integer points 
within a tile is equal to the volume of the tile, which allows us to use vol- 
ume estimates in deriving the amount of communication. We then present 
the concept of uniformly intersecting references and a method of comput- 
ing the cumulative footprint for a set of uniformly intersecting references. 
We develop a formalism for computing the volume of communication on the 
interconnection network of a multiprocessor for a given loop partition, and 
show how loop tiles can be chosen to minimize this traffic. We briefly indicate 
how the cumulative footprint can be used to derive optimal data partitions 
for multicomputers with local memory (NUMA machines). 




296 Anant Agarwal et al. 



4.1 Loop Tiles in the Iteration Space 

Loop partitioning results in a tiling of the iteration space. We consider only 
hyperparallelepiped partitions in this paper; rectangular partitions are special 
cases of these. Furthermore, we focus on loop partitioning where the tiles are 
homogeneous except at the boundaries of the iteration space. Under these 
conditions of homogeneous tiling, the partitioning is completely defined by 
specifying the tile at the origin, as indicated in Fig. 4.1. Under homogeneous 
tiling, the concept of the tile at the origin is similar to the notion of the 
clustering basis in [6]. (See Appendix A for a more general representation of 
hyperparallelepiped loop tiles based on bounding hyperplanes.) 




(L ,L ) i 
21 22 



Fig. 4.1. Iteration space partitioning is completely specified by the tile at 
the origin. 



Definition 41. An I dimensional square integer matrix L defines a semi 
open hyperparallelepiped tile at the origin of an I dimensional iteration spaee 
as follows. The set of iteration points included in the tile is 

I'x 4k = UjLi adi, 0 Z' ai < 1<> 

where L is the ith row of L. As depicted in Fig. 4-1, the rows of the matrix 
L specify the vertices of the tile at the origin. Often, we also refer to the 
partition by the L matrix since each of the other tiles is a translation of the 
tile at the origin. 

Example fl- A rectangular partition can be represented by a diagonal L 
matrix. Consider a three dimensional iteration space I -^K partitioned into 
rectangular tiles where each tile is of the form by ^(fo, j, Z j < J'O'- In 
other words, constants io and jo specify the tile completely. Such a partition 
is represented by 
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1 0 0 
0 J 0 V . 

0 0 1 ] 

Definition 42. A general tile in the iteraiion space is a translation of the 
tile at the origin. The translation vector is given by 



where Xi is an integer. A tile is completely specified by (Ai, . . . ,Xi). For ex- 
ample (0, . . . , 0) specifies the tile at the origin. 

The rest of this paper deals with optimizing the shape of the tile at the 
origin for minimal communication. Because the amount of communication is 
related to the number of integer points within a tile, we begin by proving 
the following theorem relating the volume of a tile to the number of inte- 
ger points within it. This theorem on lattices allows us to use volumes of 
hyperparallelepipeds derived using determinants to determine the amount of 
communication. 

Theorem 41. The number of integer points (iteration points) in tile L is 
equal to the volume of the tile, which is given by «l|ldet L<l|* 

Proof:. We provide a sketch of the proof; a more detailed proof is given in [18]. 

It is easy to show that the theorem is true for an n-dimensional semi-open 
rectangle. For a given n-dimensional semi-open hyperparallelepiped, let its 
volume be V and let P be the number of integer points in it. For any positive 
integer R, it can be shown that one can pack of these hyperparallelepipeds 
into an n-dimensional rectangle of volume Fr and number of integer points 
Pr, such that both Vr 0 R^V and Pr 0 R^P grow slower than R'^. In other 
words, 

Vr = R^V + v{R), Pr = R^P+p{R) 

where v{R) and p{R) grow slower than i?". Now subtracting the second 
equation from the first one, and noting that Vr = Pr for the n-dimensional 
rectangle, we get, 

F0P= (p(P) 0u(P))/P". 

Given that both v{R) and p{R) grow slower than P", this can only be true 
when F 0 P = 0. □ 

Proposition 41. The number of integer points in any general tile is equal 
to the number of integer points in the tile at the origin. 

Proof:. Straight-forward from the definition of a general tile. □ 

In the following discussion, we ignore the effects of the boundaries of the 
iteration space in computing the number of integer points in a tile. As our 
interest is in minimizing the communication for a general tile, we can ignore 
boundary effects. 
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4.2 Footprints in the Data Space 

For a system with caches and uniform access memory, the problem of loop 
partitioning is to find an optimal matrix L that minimizes the number of 
cache misses. The first step is to derive an expression for the number of cache 
misses for a given tile L. Because the number of cache misses is related to 
the number of unique data elements accessed, we introduce the notion of a 
footprint that defines the data elements accessed by a tile. The footprints are 
regions of the data space accessed by a loop tile. 

Definition 43. The footprint of a tile of a loop partition with respect to a 
reference A[g(i)] is the set of all data elements A[g(i)] of A, for i an element 
of the tile. 

The footprint gives us all the data elements accessed through a particular 
reference from within a tile of a loop partition. Because we consider homoge- 
neous loop tiles, the number of data elements accessed is the same for each 
loop tile. 

We will compute the number of cache misses for the system with caches 
and uniform access memory to illustrate the use of footprints. The body 
of the loop may contain references to several variables and we assume that 
aliasing has been resolved; two references with distinct names do not refer 
to the same location. Let Ai, A 2 , . . . , Ak be references to array A within the 
loop body, and let f{Ai) be the footprint of the loop tile at the origin with 
respect to the reference Ai and let 

/(Al, A2, . . . , Ajv) = 

be the cumulative footprint of the tile at the origin. The number of cache 
misses with respect to the array A is ^(Ai, A 2 , . . . , Aiv)4»Thus, computing 
the size of the individual footprints and the size of their union is an important 
part of the loop partitioning problem. 

To facilitate computing the size of the union of the footprints we divide the 
references into multiple disjoint sets. If two footprints are disjoint or mostly 
disjoint, then the corresponding references are placed in different sets, and 
the size of the union is simply the sum of the sizes of the two footprints. 

However, references whose footprints overlap substantially are placed in 
the same set. The notion of uniformly intersecting references is introduced 
to specify precisely the idea of “substantial overlap” . Overlap produces tem- 
poral locality in cache accesses, and computing the size of the union of their 
footprints is more complicated. 

The notion of uniformly intersecting references is derived from definitions 
of intersecting references and uniformly generated references. 

Definition 44. Two references A[gi(i)] and A[g 2 (i)] are said to be inter- 
secting if there are two integer veetors ii,i 2 such that gi(ii) = g 2 (i 2 )- For 
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example, A[i + cl,j + c2] and A[j + c3, f + c4] are intersecting, whereas A\2i\ 
and A[2i + 1] are non-intersecting. 

Definition 45. Two references A[gi(i)] and yl[g 2 (i)] are said to be uniformly 
generated if 

gi{Vj = iG + ai and 32 ( 1 ) = iG + a 2 
where G is a linear transformation and ai and a 2 are integer constants. 

The intersection of footprints of two references that are not uniformly gen- 
erated is often very small. For non-uniformly generated references, although 
the footprints corresponding to some of the iteration-space tiles might overlap 
partially, the footprints of others will have no overlap. Since we are interested 
in the worst-case communication volume between any pair of footprints, we 
will assume that the total communication generated by two non-uniformly 
intersecting references is essentially the sum of the individual footprints. 

However, the condition that two references are uniformly generated is 
not sufficient for two references to be intersecting. As a simple example, 
A[2i] and A[2i 1] are uniformly generated, but the footprints of the two 
references do not intersect. For the purpose of locality optimization through 
loop partitioning, our definition of reuse of array references will combine the 
concept of uniformly generated arrays and the notion of intersecting array 
references. This notion is similar to the equivalence classes within uniformly 
generated references defined in [3]. 

Definition 46. Two array references are uniformly intersecting if they are 
both intersecting and uniformly generated. 

Example J^2. The following sets of references are uniformly intersecting. 

1. A[i,j],A[i + l,j®2,],A[i,j -h4]. 

2. A{2j, 2, i],A[2j 0 5, 2, i],A[2j -h 3, 2, i]. 

The following pairs are not uniformly intersecting. 

1. A[i,j],A[2i,j\. 

2. A[i,f\,A[2i,2j]. 

3. A[j,2,i],A[j,Z,i\. 

4. A[2i],A[2i + l]. 

5. A[i + 2,2i + 4],A[i + b,2i + S]. 

6. j]. 

Footprints in the data space for a set of uniformly intersecting references 
are translations of one another, as shown below. The footprint with respect 
to the reference (G,as) is a translation of the footprint with respect to the 
reference (G, a,,), where the translation vector is a^ 0 a,,. 
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Proposition 42. Given a loop tile at the origin L and references r = (G, a^) 
and s = (G,ag) belonging to a uniformly generated set defined by G, let f{r) 
denote the footprint of L with respect to r, and let f{s) denote the footprint 
o/L with respect to s. Then f{s) is simply a translation of f{r), where each 
point of f{s) is a translation of a corresponding point of f{r) by an amount 
given by the vector (a.s® a^, j. In other words, 

f{s) = /(r) + (as 0a^). 

This follows directly from the definition of uniformly generated references. 
Recall that an element i of the loop tile is mapped by the reference (G,ar) 
to data element = iG + a^, and by the reference (G, a^) to data element 
ds = iG + a^,. The translation vector, (d® 0 d^), is clearly independent of i. 

The volume of cache traffic imposed on the network is related to the 
size of the cumulative footprint. We describe how to compute the size of the 
cumulative footprint in the following two sections as outlined below. 

— First, we discuss how the size of the footprint for a single reference within a 
loop tile can be computed. In general, the size of the footprint with respect 
to a given reference is not the same as the number of points in the iteration 
space tile. 

— Second, we describe how the size of the cumulative footprint for a set of 
uniformly intersecting references can be computed. The sizes of the cumu- 
lative footprints for each of these sets are then summed to produce the size 
of the cumulative footprint for the loop tile. 

4.3 Size of a Footprint for a Single Reference 

This section shows how to compute the size of the footprint (with respect to 
a given reference and a given loop tile L) efficiently for certain common cases 
of G. The general case of G is dealt with in Sect. 5. We begin with a simple 
example to illustrate our approach. 



Example 43. 

Doall (i=0:99, j=0:99) 

A[i,j] = B[i+j , j]+B[i+j+l, j+2] 
EndDoall 

The reference matrix G is 



1 0 
1 1 
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Let us suppose that the loop tile at the origin L is given by 

Li Li 
L 2 0 




Fig. 4.2. Tile L at the origin of the iteration space. 



In Fig. 4.2 we show this tile at the origin of the iteration space and the 
footprint of the tile (at the origin) with respect to the reference B[i + j,j] is 
shown in Fig. 4.3. The matrix 

{{B[i + j, j]) = LG = 

describes the footprint. As shown later, the integer points in the semi open 
parallelogram specified by LG is the footprint of the tile and so the size of 
the footprint is 4det(LG)<l|t We will use D to denote the product LG as it 
appears often in our discussion. 

The rest of this subsection focuses on deriving the set of conditions under 
which the footprint size is given by 4det(D)4» Briefly, we show that G being 
unimodular is a sufficient (but not necessary) condition. The next section 
derives the size of the cumulative footprint for multiple uniformly intersecting 
references. 

In general, is the footprint exactly the integer points in D = LG? If not, 
how do we compute the footprint? The first question can be expanded into 
the following two questions. 

— Is there a point in the footprint that lies outside the hyperparallelepiped 
D? It follows easily from linear algebra that it is not the case. 



2Li Li 

L 2 0 
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2 

Fig. 4.3. Footprint of L wrt B\i + j,j\ in the data space. 



— Is every integer point in D an element of the footprint? It is easy to show 
this is not true and a simple example corresponds to the reference A[2z]. 

We first study the simple case when the hyperparallelepiped D completely 
defines the footprint. A precise definition of the set ^(D) of points defined 
by the matrix D is as follows. 

Definition 47. Given a matrix D whose rows are the vectors di, 1 /'i /'m, 
«S'(D) is defined as the set 

I'x 4k = Oldi + tt 2 d 2 + . . . + Orndju, 0 /" Qi < I<> 

«S'(D) defines all the points in the semi open hyperparallelepiped defined by 

D. 



So for the case where D completely defines the footprint, the footprint 
is exactly the integer points in <S'(D). One of the cases where D completely 
defines the footprint, is when G is unimodular as shown below. 

Lemma 41. The mapping efi <= as defined by G is one to one if and 
only if the rows of G are independent. Further, the mapping of the iteration 
space to the data space (Z'- ^ Z‘^) as defined by G is one to one if and only 
if the rows of G are independent. 

Proof:. iiG = i 2 G implies ii = i 2 if and only if the only solution to xG = 0 
is 0. The latter implies that the nullspace of G^ is of dimension 0. From a 
fundamental theorem of Linear Algebra [19], this means that the rows of G 
are linearly independent. It is to be noted that when the rows of G are not 
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independent there exists a nontrivial integer solution to xG = 0, given that 
the entries in G are integers. This proves the second statement of the lemma. 
□ 

Lemma 42. The mapping of the iteration space to the data space as defined 
by G is onto if and only if the columns of G are independent and the g.c.d. 
of the subdeterminants of order equal to the number of columns is 1. 

Proof:. Follows from the Hermite normal form theorem as shown in [20]. □ 

Lemma 43. If G is invertible then d =>LG if and only i/dG®^ =>L. 

Proof:. Clearly G is invertible implies, 

d ^LG =fr dG®^ ^LGG®^ = L 



Also, 

dG®^ ^L =fr dG®^G ^LG =fr d ^LG. 

G is invertible implies that the rows of G are independent and hence the 
mapping defined by G is one to one from Lemma 4.1. □ 

Theorem 42. The footprint of the tile defined by L with respect to the refer- 
ence G is identical to the integer points in the semi open hyperparallelepiped 
D = LG z/ G is unimodular. 

Proof:. It is immediate from Lemma 42 that G is onto when it is unimodular. 
G is onto implies that every data point in D has an inverse in the iteration 
space. Can the inverse of the data point be outside of L? Lemma 43 shows 
this is not possible since G is invertible. □ 

We make the following two observations about Theorem 42. 

— G is unimodular is a sufficient condition; but not necessary. An example 
corresponds to the reference A[i-\-j]. Further discussions on this is contained 
in Sect. 5. 

— One may wonder why G being onto is not sufficient for D to coincide 
with the footprint. Even when every integer point in D has an inverse, 
it is possible that the inverse is outside of L. For example, consider the 
mapping defined by the G matrix 



4 

5 

corresponding to the reference A[4f + 5j]. It is onto as shown by Lemma 42. 
Flowever, we will show that not all points in LG are in the footprint. 
Consider, 

L = 



100 0 
0 100 
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LG defines the interval [0, 900) and so it includes the data point (1). But 
it can be shown that none of the inverses of the data point (1) belong 
to L; (-1, 1) is an inverse of (1). The same is true for the data points 
(2), (3), (6), (7), and (11). The one to one property of G guarantees that 
no point from outside of L can be mapped to inside of D. The reason for 
this is that the one to one property is true even when G is treated as a 
function on reals. 

Let us now introduce our technique for computing the cumulative foot- 
print when G is unimodular. Algorithms for computing the size of the indi- 
vidual footprints and the cumulative footprint when G is not unimodular are 
discussed in Sect. 5. 

4.4 Size of the Cumulative Footprint 

The size of the cumulative footprint F for a loop tile is computed by sum- 
ming the sizes of the cumulative footprints for each of the sets of uniformly 
intersecting references. This section presents a method for computing the 
size of the cumulative footprint for a set of uniformly intersecting references 
when G is unimodular, that is, when the conditions stated in Theorem 42 
are true. More general cases of G are discussed in Sect. 5. We first describe 
the method when there are exactly two uniformly intersecting references, and 
then develop the method for multiple references. 

Cumulative Footprint for Two References. Let us start by illustrating the 
computation of the cumulative footprint for Example 43. The two references 
to array B form a uniformly intersecting set and are defined by the following 
G matrix. 




Let us suppose that the loop partition L is given by 

Lii Li2 
L21 L22 

Then D is given by 

Lll -h Li2 Li2 

L21 + L22 L22 

The parallelogram defined by D in the data space is the parallelogram ABCD 
shown in Fig. 4.4. ABCD and EFGH shown in Fig. 4.4 are the foot- 
prints of the tile L with respect to the two references {B[i + j,j\ and B\i -\- 
j + l,j + 2] respectively) to array B. In the figure, AB = (Ln -|- Li 2 ,Li 2 ), 
AD = (L 21 + L 22 , L 22 ), and AE = (1, 2). 

The size of the cumulative footprint is the size of footprint ABCD plus 
the number of data elements in EPDS plus the number of data elements in 
SRGH. Given that G is unimodular, the number of data elements is equal to 
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Fig. 4.4. Data footprint wrt B[i + j,j\ and B[i + J + 1, J + 2] 



the area ABCD + SRGH + EPDS = ABCD + ADST + CDUV 0 SDUH. 
Ignoring the area SDUH , we can approximate the total area by 



L\i + Li2 Li 
L21 + L22 L2 



+ pet 



L\2 Li2 
2 



The first term in the above equatie n represents the area ofThe footprint of a 
single reference, i.e., <l|ldet(D)<l|Ht is well known that the area of a parallelo- 
gram is given by the determinant of the matrix specifying the parallelogram. 
The second and third terms are the determinants of the D matrix in which 



one row is replaced by the offset vector a = (1, 2). Fig. 4.5 is a pictorial rep- 
resentation of the approximation. The first term is the parallelogram ABCD 
and the second and third terms are the shaded regions. 

Ignoring SDUH is reasonable if we assume that the offset vectors in a 
uniformly intersecting set of references are small compared to the tile size. We 
refer to this simplification as the overlapping sub-tile approximation. This ap- 
proximation will result in our estimates being higher than the actual values. 
Although one can easily derive a more exact expression, we use the overlap- 
ping sub-tile approximation to simplify the computation. Fig. 8.1 in Sect. 8 
further demonstrates that the error introduced is insignificant, especially for 
parallelograms that are near optimal. 

The following expression captures the size of the cumulative footprint for 
the above two references in which one of the offset vectors is (0, 0): 




where, a is the matrix obtained by replacing the /cth row of D by a. 
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□□ - Approximation of EPDS + SRGH 

Fig. 4.5. Difference between the cumulative footprint and the footprint. 



If both the offset vectors are nonzero, because only the relative position 
of the two footprints determines the area of their non-overlapping region, we 
use a = ai (g) ao in the above equation. The following discussion formalizes 
this notion and extends it to multiple references. 

Cumulative Footprint for Multiple References. The basic approach for es- 
timating the cumulative footprint size involves deriving an effective offset 
vector a that captures the combined effects of multiple offset vectors when 
there are several overlapping footprints resulting from a set of uniformly in- 
tersecting references. First, we need a few definitions. 

Definition 48. Given a loop tile L, there are two neighboring loop tiles along 
the ith row o/L defined by x =>tileL'0and = x(g)li, x => 

tile L<^ where 1^ is the ith row of L, for 1 /'i /'I . We refer to the former 
neighbor as the positive neighbor and the latter as the negative neighbor. We 
also refer to these neighbors as the neighbors of the parallel sides of the tile 
determined by the rows of L, excluding the ith row. Fig. 4-6 illustrates the 
notion of neighboring tiles. 

The notion of neighboring tiles can be extended to the data space in like 
manner as follows. 

Definition 49. Given a loop tile L and a reference (G,ar), the neighbors of 
the data footprint ofL along the kth row o/D = LG are = X + djie, X 
D + a-rO and Ify ^ = X 0 dfe, X =>D -h a^O; where is the kth row of D, 
for 1 /k /d. 
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Definition 410. Given a tile L, L is a sub-tile wrt the ith row of L if the 
rows of L are the same as the rows of L except for the ith row which is a 
times the ith row ofL, 0 /a /'I. 

The approximation of the cumulative footprint in Fig. 4.5 can be ex- 
pressed in terms of sub-tiles of the tile in the data space. ABCD is a tile in 
the data space and the two shaded regions in Fig. 4.5 are sub-tiles of neigh- 
boring tiles containing portions of the cumulative footprint. One can view the 
cumulative footprint as any one of the footprints together with communica- 
tion from the neighboring footprints. The approximation of the cumulative 
footprint expresses the communication from the neighboring tiles in terms of 
sub-tiles to make the computation simpler. 

Definition 411. Let h be a loop tile at the origin, and let g(i) = iG -|- a^, 
1 Z' r R be a set of uniformly intersecting references. For the footprint 
ofL with respect to reference (G,ar), communication along the positive di- 
rection of the kth row of D is defined as the smallest sub tile of the positive 
neighbor along the kth row of the footprint which contains the elements of 
the cumulative footprint within that neighbor. Communication along the neg- 
ative direction is defined similarly. Communication along the kth row is the 
sum of these two communications. Each row of D defines a pair of parallel 
sides (hyperplanes) of the data footprint determined by the remaining rows 
o/D. We sometimes refer to the communication along the kth row as the 
communication across the parallel sides of D defined by the kth row. 

The notion of the communication along the rows of D facilitates comput- 
ing the size of the cumulative footprint. Consider the data footprints of a loop 
tile with respect to a set of uniformly intersecting references shown in Fig. 4.7. 
Here di, d 2 correspond to the rows of the matrix D = LG. The vectors ai, 
... as are the offset vectors corresponding to the set of uniformly intersecting 
references. The cumulative footprint can be expressed as the union of any 
one of the footprints and the remaining elements of the cumulative footprint. 
We take the union because a given data element needs to be fetched only 
once into a cache. 

In Fig. 4.7, the cumulative footprint is the union of the footprint of the 
loop tile with respect to a 4 and the shaded regions corresponding to the 
remaining elements of the cumulative footprint resulting from the other ref- 
erences. The area of the shaded region can be approximated by the sum of 
communication along the fcth row for 1 /' fc /'2 as shown in Fig. 4.8. The 
area of the communication along d 2 is equal to the area of the parallelogram 
whose sides are di and as (g) a 4 . Among the offset vectors, vector as has the 
maximum component along d 2 and vector a 4 has the minimum (taking the 
sign into account) component along d 2 . Similarly the area of the commu- 
nication along di is equal to the area of the parallelogram whose sides are 
d 2 and a 4 0 ai plus the area of the parallelogram whose sides are d 2 and 
as 0 a 4 . This is equal to the area of the parallelogram whose sides are d 2 
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and as 0 ai . As before among the offset vectors, vector as has the maximum 
component along di and vector ai has the minimum (taking the sign into 
account) component along di. This observation is used in the proof of The- 
orem 43. It turns out that the effect of offset vector as 0 ai along d 2 and 
as 0 a 4 along di can be captured by a single vector a as shown later. 

- Neg. Neighbours along 1 j 

- Pos . Neighbours along 1 j 

- Neg. Neighbours along 1 ^ 

- Pos . Neighbours along 1 



Fig. 4.6. Neighboring Tiles 



Proposition 43. Let h be a loop tile at the origin, and let g(i) = iG -|- a^, 
be a set of uniformly intersecting references. The volume of communication 
along the kth row of T), 1 k d, is the same for eaeh of the footprints 
(corresponding to the different offset vectors). 

Communication along the positive and negative directions will be different 
for different footprints. But the total communication along the /cth row, 1 
k /'d, is the same for each of the data footprints. 

We now derive an expression for the cumulative footprint based on our 
notion of communication across the sides of the data footprint. Our goal is 
to capture in a single offset vector a the communication in a cache-coherent 
system resulting from all the offset vectors. More specifically, we would like 
the k^^ component of a to reflect the communication per unit area across the 
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parallel sides defined by the fcth row of D. The effective vector a is derived 
from the spread of a set of offset vectors. 

Definition 412. Given a set of d-dimensional offset vectors a^, 1 /'r R, 
spread{a.i, . . . is a vector of the same dimension as the offset vectors, 

whose kth component is given by 

max(ar k) min(ar =>1 , . . . ,d. 

In other words, the spread of a set of vectors is a vector in which each com- 
ponent is the difference between the maximum and minimum of the corre- 
sponding components in each of the vectors. 

For caches, we use the max ® min formulation (or the spread) to cal- 
culate the amount of communication traffic because the data space points 
corresponding to the footprints whose offset vectors have values between the 
max and the min lie within the cumulative footprint calculated using the 
spread.^ 

The spread as defined above does not quite capture the properties that 
we arc looking for in a single offset vector except when D is rectangular. If 
D is not rectangular, the fcth component of spread(a) does not reflect the 
communication per unit area across the parallel sides defined by the /cth row 
of D. To derive the footprint component (or sub tile) along a row of D, 
we need to compute the difference between the maximum and the minimum 
components of the offset vectors using D as a basis. Therefore, we extend the 
notion of spread to a general basis as follows. Recall that D is a basis for the 
data space when G is unimodular. 

In the definition below, is the representation of offset vector using 
D as the basis. 

Definition 413. Given a set of offset vectors a^, 1 y' r y R, let hr = 
a^D*^, U-r ^1, . . . ,R and let b be spreadfbi, . . . , hy. Then 

a = spread] 3 (ai, . . . , a^;) = 6D. 



Looking at the special case where D is rectangular helps in understanding 
the definition. 

Proposition 44. If D is rectangular then 



a = spread(ai , . . . , a^j) = spread^, (ai , . . . , a^;) 



In other words, 



dk = max(aj.^A:) 0 min(ar,fc), -IJ-fc =>1 , . . . ,d. 

r ’ r ’ 



^ For data partitioning, however, the formulation must be modified as discussed 
in Sect. 6. 
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For example, spreadj((l, 0), (2, (8>1)) = (2 (g) 1, 0 0 1) = (1, 1). 

For D = 0 1’ spread is given by, 

spread£)((l, 0), (2, 0l)) = spread((l, 0)D®^, (2, 0l)D®^)D = (1,3) 

Lemma 44. Given a hyperparallelepiped tile L, and a set of uniformly in- 
tersecting references g(i) = iG + a^, where G is unimodular, the commu- 
nication along the kth row o/ D = LG is dldet a^where a = 

spread£)(ai, . . . , a^j) and „ is the matrix obtained by replacing the kth 
row of D by a. 

Proof:. Straight-forward from the definition of spread and the definition of 
communication along the kth row. □ 

Theorem 43. Given a hyperparallelepiped tile L and a unimodular reference 
matrix G, the size of the cumulative footprint with respect to a set of uni- 
formly intersecting references specified by the reference matrix G and a set 
of offset vectors ai, . . . , a^j, is approximately 

d 

J|det D<I|M- 4det Dfe 

< =1 

where a = spread£,(ai, . . . , a^j) and a is the matrix obtained by replacing 
the kth row of D by d. 

Proof:. As observed earlier, the size of the cumulative footprint is approx- 
imately the size of any of the footprints plus the communication across its 
sides. Glearly the size of any one of the footprints is given by «l|ttlet D4i The 
rest follows from Lemma 44- 

Finally, as stated earlier, the total communication generated by non- 
uniformly intersecting sets of references is essentially the sum of the com- 
municating generated by the individual cumulative footprints. Example 45 
in Sect. 4.5 discusses an instance of such a computation. 

4.5 Minimizing the Size of the Cumulative Footprint 

We now focus on the problem of finding the loop partition that minimizes 
the size of the cumulative footprint. The overall algorithm is summarized in 
Table 4.1. The minimization of C, the communication is done using standard 
optimization algorithms including numerical techniques. 

Let us illustrate this procedure through the following two examples. 
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Table 4.1. An algorithm for minimizing cumulative footprint size for a single 
set of uniformly intersecting references. For multiple uniformly intersecting 
sets, add the communication component due to each set and then determine 
L that minimizes the sum. 



Given: G, offset vectors ai, . . . , 

Goal: Find L to minimize cumulative footprint size 



Procedure: Write D = LG 

Find bi,...,bij = , . . . , 

Find b = spread(bi, . . . , hji) 

Then, write a = 6D 

Communication C = (lltlet 1 dldet a4» 
Finally, find the parameters of L that minimize C 



Example 44- 



Doall (i=l:N, j = l:N, k=l:N) 

A[i,j,k] = B[i-l,j,k+l] + B[i,j+l,k] + B [i+1 , j-2,k-3] 

EndDoall 



Here we have two uniformly intersecting sets of references: one for A 
and one for B. Let us look at the class corresponding to B since it is more 
instructive. Because A has only one reference, whose G is unimodular, its 
footprint size is independent of the loop partition, given a fixed total size of 
the loop tile, and therefore need not figure in the optimization process. The 
G matrix corresponding to the references to B is. 





1 


0 


0 




0 


1 






0 


0 


0 , 



The a vector is (2, 3, 4). Consider a rectangular partition L = A given by 



Li 0 0 

0 Lj 0 V 
0 0 Lfe ] 

In this example, the D matrix is the same as the L matrix. Because D is 
rectangular, we can apply Proposition 44 in simplifying the derivation of a. 
The size of the cumulative footprint for B can now be computed according 
to Theorem 43 as 



LiLjLk + 2LjLk + SLiLk + ALiLj 
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This expression must be minimized keeping ^det L4» (or the product 
LiLjLk) a constant. The product represents the area of the loop tile and 
must be kept constant to ensure a balanced load. The constant is simply 
the total area of the iteration space divided by P, the number of processors. 
For example, if the loop bounds are /, J , and K , then we must minimize 
LiLjLk+2LjLk+3LiLk+4:LiLj, subject to the constraint LiLjLk = IJK/P. 

This optimization problem can be solved using standard methods, for ex- 
ample, using the method of Lagrange multipliers [21]. The size of the cumula- 
tive footprint is minimized when Li , Lj , and are chosen in the proportions 
2, 3, and 4, or 

Li : Lj : Lk 2 : 3 : A 

This implies, 

Li = (/JiC/3P)l/^ Lj = (3/2)(/JiC/3P)l/^ and L^ = 2{IJK/3P)^^^. 

Abraham and Hudak’s algorithm [7] gives an identical partition for this ex- 
ample. 

We now use an example to show how to minimize the total number of 
cache misses when there are multiple uniformly intersecting sets of references. 
The basic idea here is that the references from each set contribute additively 
to traffic. 



Example 45- 



Doall (i=l:N, j = l:N) 

A(i,j) = B(i-2,j) + B(i,j-1) + C(i+j-l,j) + C(i+j+l,j+3) 

EndDoall 



There are three uniformly intersecting classes of references, one for B, one 
for C, and one for A. Because A has only one reference, its footprint size is 
independent of the loop partition, given a fixed total size of the loop tile, and 
therefore need not figure in the optimization process. 

For simplicity, let us assume that the tile L is rectangular and is given by 



L\ 0 
0 L2 ■ 

Because G for the references to array B is the identity matrix, the D = LG 
matrix corresponding to references to B is same as L, and the d vector is 
spread{®2, 0), (0, (Sil)) = (2, 1). Thus, the size of the corresponding cumula- 
tive footprint according to Theorem 43 is 



Pi 0 

0 P2 



2 1 
0 P2 



Pi 0 

2 1 




314 Anant Agarwal et al. 



Similarly, D for array C is 

Li 0 
L2 L2 

The data footprint D is not rectangular even though the loop tile is. Using 
Dehnition 413, a = spreadD((<S>l, 0), (1, 3)) = (4,3), and the size of the 
cumulative footprint with respect to C is 



Al 0 

L2 L2 




Li 0 

4 3 



The problem of- minimizing th; size of tfi fcwtprint reduces to finding the 
elements of L thavminimizes^the sum of tSe two expressions above subject 
to the constraint the area of the loop tile <l|ttlet L^s a constant to ensure a 
balanced load. For example, if the loop bounds are /, J, then the constraint 
is 4det LJ|t= IJ/P, where P is the number of processors. 

The total size of the cumulative footprint simplifies to 2Tii2 + 4Ti + 3T2- 
The optimal values for Li and L 2 can be shown to satisfy the equation 
4Li = 3 L 2 using the method of Lagrange multipliers. 



5. General Case of G 

This section analyzes the size of the footprint and the cumulative footprint 
for a general G, that is, when G is not restricted to be unimodular. The 
computation of the size of the footprint is by case analysis on the G matrix. 

5.1 G Is Invertible, but Not Unimodular 

G is invertible and not unimodular implies that not every integer point in 
the hyperparallelepiped D is an image of an iteration point in L. A unit cube 
in the iteration space is mapped to a hyperparallelepiped of volume equal to 
<l|ldet G«l|k So the size of the data footprint is 4det D/ det G#f= ^detL^iWhen 
G is invertible the size of the data footprint is exactly the size of the loop 
tile since the mapping is one to one. 

Next, the expression for the size of the cumulative footprint is very similar 
to the one in Theorem 43, except that the data elements accessed are not 
dense in the data space. That is, the data space is sparse. 

Lemma 51. Given an iteration space X, a reference matrix G, and a hyper- 
parallelepiped Di in the data space, if the vertices o/DiG*^ are in I then 
the number of elements in the intersection o/Di and the footprint of I with 
respect to G is 4det Di/ det G<l|* 




Optimal Tiling for Minimizing Communication in DMMs 315 



Proof:. Clear if one views DiG*^ as the loop tile L. □ 

Theorem 51. Given a hyperparallelepiped tile L, and an invertible referenee 
matrix G, the size of the cumulative footprint with respect to a set of uni- 
formly intersecting references specified by the reference matrix G and a set 
of offset vectors ai, . . . , is approximately 

Alet G* 

where a = spread(ai, . . . , a^j, D) and a is the matrix obtained by replac- 
ing the kth row of T) by a. 

Proof:. Using lemma 51 one can construct a proof similar to that of Theo- 
rem 45 . O 

Example 32 (repeated below for convenience) possesses a G that is in- 
vertible, but not unimodular. 

Doall (1=101:200, j = l:100) 

A[i,j] = B[i+j ,i-j-l]+B[i+j+4,i-j+3] 

EndDoall 

For this example, the reference matrix G corresponding to array B is 

1 1 
1 01 ’ 

and the offset vectors are 

ao = (0, 01) and ai = (4, 3) 

Let us find the optimal rectangular partition L of the form 




The footprint matrix D is given by 

Pi Li 
Lj ^Lj 

The offset vectors using D as a basis are 

bo = aoD®^ = (0l/(2Li), 

bi = aiD®i = (7/(2L,),l/(2L,))- 

The vector b = (4/Li, 0) and the vector 

a = 6D = (4, 4) 
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The size of the cumulative footprint according to Theorem 51 is 





Li Li 

Lj ^Lj 


+ 


Li Li 

4 4 


+ 


4 4 

Lj ^Lj 


whid 


1 is 


- 


1 1 : 
1 (8)1 : 


- 


- 



LiLj + 4Tj 

If we constrain LiLj = 100 for load balance, we get Lj = 1 and Li = 100. 
This partitioning represents horizontal striping of the iteration space. 

5.2 Columns of G Are Dependent and the Rows Are Independent 

We can apply Theorem 51 to compute the size of a footprint when the columns 
of G are dependent, as long as the rows are independent. We derive a G from 
G by choosing a maximal set of independent columns from G, such that G is 
invertible. We can then apply Theorem 51 to compute the size of the footprint 
as shown in the following example. 

Example 51. Consider the reference A[i,2i,i + j\ in a doubly nested loop. 
The columns of the G matrix 

■ 1 2 1 
0 0 1 

are not independent. We choose G to be 

1 1 
0 1 ■ 

Now D = LG completely specifies the footprint. The size of the footprint 
equals dldet D <l|k= 4tlet L4ilf we choose G to be 

2 1 
0 1 

then the size of the footprint is <l|ttlet D ^2 for the new D since 4det G #s 
now 2. But both expressions evaluate to the same value, dldet L<l|»as one would 
expect. 

5.3 The Rows of G Are Dependent 

The rows of G are dependent means that the mapping from the iteration 
space to the data space is many to one. It is hard to derive an expression 
for the footprint in general when the rows are dependent. However, we can 
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compute the footprint and the cumulative footprint for many special cases 
that arise in actual programs. In this section we shall look at the common case 
where the rows are dependent because one or more of the index variables do 
not appear in the array reference. We shall illustrate our technique with the 
matrix multiply program shown in Example 52 below. The notation 1$C [i , j ] 
means that the read- modify- write of C[i,j] is atomic. 

Example 52. 

Doall (i=0;N, j=0:N, k=0:N) 

l$C[i,j] = l$C[i,j] + A[i,k]+B[k, j] 

EndDoall 

The references to the matrices A, B and C belong to separate uniformly in- 
tersecting references. So the cumulative footprint is the sum of the footprints 
of each of the references. We will focus on A[i,k] and footprint computation 
for the other references are similar. The G matrix for A[i,k] is 

1 0 
0 0 
0 1 

We cannot apply our earlier results 
a many to one mapping. However, we can find an invertible G such that for 
every loop tile L, there is a tile L such that the number of elements in foot- 
prints LG and LG are the same. For the current example, G is obtained 
from G by deleting the row of zeros, resulting in a two dimensional identity 
matrix. Similarly L is obtained from L by eliminating the corresponding 
(second) column of L. Now, it is easy to show that the number of elements in 
footprints LG and LG are the same by establishing a one-to-one correspon- 
dence between the two footprints. Let us use this method to compute the size 
of the footprint corresponding to the reference A[i,k] . Let us assume that 
L is rectangular to make the computations simpler. Let L be 

Li 0 0 

0 Lj 0 
0 0 Lfe 

Now L is 

Li 0 

0 0 X 

0 Lk j 

So the size of the footprint is LiLk- Similarly, one can show that the size 
of the other two footprints are LiLj and LjLk- The cumulative footprint is 
LiLk + LiLj + LjLk which is minimized when Li, Lj and Lk are equal. 



to compute the footprint since G is 
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6. Other System Environments 

This section describes how our framework can be used to solve the parti- 
tioning problem in a wide range of systems. We discuss (1) loop partitioning 
to minimize coherency traffic in systems with coherent caches and uniform- 
access main memory, (2) loop partitioning with non-unit cache line sizes, 
(3) and loop and data partitioning in distributed-memory systems without 
caches. Then, Sect. 7 discusses simultaneous loop and data partitioning in 
cache-coherent distributed shared- memory memory systems (NUMA). 

6.1 Coherence- Related Cache Misses 

Our analysis presented in the previous section was concerned with minimiz- 
ing the cumulative footprint size. This process of minimizing the cumulative 
footprint size not only minimizes the number of first-time cache misses, but 
the number of coherence-related misses as well. For example, consider the 
forall loop embedded within a sequential loop in Example 61. Here forall 
means that all the reads are done prior to the writes. In other words the data 
read in iteration t corresponds to the data written in iteration t (g) 1. 



Example 61. 

Doseq (t=l:T) 

forall (i=l :N, j=l :N) 
A(i,j) = A(i+l,j) 
EndDoall 
EndDoseq 

For this example, we have 



Let us attempt to minimize the cumulative footprint for a loop partition of 
the form 

Li 0 

0 L, 

The cumulative footprint size is given by 

1 j ^ 1 ^ j I 7 / j 

In a load-balanced partitioning, 4tletL4t= LiLj is a constant, so the LiLj 
term drops out of the optimization. The optimization process then attempts 
to minimize which is proportional to the volume of cache coherence traffic, 
as depicted in Fig. 6.1. 
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Let us focus on regions X, Y and Z in Fig. 6.1(c). As explained in Fig. 4.8, 
the processor working on the loop tile to which these regions belong (say, 
processor Pq) shares a portion of its cumulative footprint with processors 
working on neighboring regions in the data space. Specifically, region Z is a 
sub-tile of the positive neighbor and region Y is a sub-tile shared with its 
negative neighbor. Region X, however, is completely private to Pq- 

Let us consider the situation after the first iteration of the outer sequen- 
tial loop. Accesses of data elements within region X will hit in the cache, 
and thereby incur zero communication cost. Data elements in region Z, how- 
ever, potentially cause misses because the processor working on the positive 
neighbor might have previously written into those elements, resulting in those 
elements being invalidated from Pq’s cache. Each of these misses by processor 
Pq suffers a network round trip because of the need to inform the processor 
working on its positive neighbor to perform a write-back and then to send 
the data to processor Pq- Furthermore, if the home memory location for the 
block is elsewhere, the miss requires an additional network round-trip. Sim- 
ilarly, in region Y, a write by processor Pq potentially incurs two network 
round trips as well. The two round trips result from the need to invalidate the 
data block from the cache of the processor working on the negative neighbor, 
and then to fetch the blocks into Po’s cache. 




Footprint of Cumulative 

A[i,j] Footprint of 

A[iJ], A[i-t1,j] 




Fig. 6.1. (a) Footprint of reference A[i,j] for a rectangular L. (b) Cumu- 
lative footprint for the references and A[i + l,j\. The hashed region 

Z represents the increase in footprint size due to the reference A[i + l,j]. 
(c) The regions X, Y, Z, collectively represent the cumulative footprint for 
references A[i,j] and A[i -\- l,j]. Region Z represents the area in the data 
space shared with the positive neighbor. Region Y represents the area in the 
data space shared with the negative neighbor. 
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In any case, the coherence traffic is proportional to the area of the shared 
region Z, which is equal to the area of the shared region Y, and is given by Lj. 
So the total communication is minimized by choosing the tile with Lj = 1. 

6.2 Effect of Cache Line Size 

The effect of cache line sizes can be incorporated easily into our analysis. 
Because large cache lines fetch multiple data words at the cost of a single 
miss, one data space dimension will be favored by the cache. Without loss 
of generality, let us assume that the dimension of the data space benefits 
from larger cache lines. Then, the effect of cache lines of size B can be incor- 
porated into our analysis by replacing each element dij in the column of 
D in Theorem 43 by 




to reflect the lower cost of fetching multiple words in the dimension of 
the data space^, and by modifying the definition of intersecting references to 
the following. 

Definition 61 . Two references A[gi(i)] and A[g2(i)] are said to be inter- 
secting if there are two integer vectors ii , i2 for which 

A[gi(ii)] = A[{dii,di2, ■ ■ ■)] and A[g2{h)] = A[{d2i,d22, ■ ■ ■)] such that 
A[(. . . ,di(j®i), /%■ y •••)] = ^[(- • • ,f^ 2 (j(g,i), /%■ y •••)]; where B is the size 

of a cache line, \,nd/the dimension in the\datfi space benefits from larger 
cache lines. 

6.3 Data Partitioning in Distributed-Memory Multicomputers 

In systems in which main memory is distributed with the processing nodes 
(e.g., see the NUMA systems in Fig. 3.3(a) and (b)), data partitioning is the 
problem of partitioning the data arrays into data tiles and the nested loops 
into loop tiles and assigning the loop tiles (indicated by L) to the processing 
nodes and the corresponding data tiles (indicated by D) to memory modules 
associated with the processing nodes so that a maximum number of the data 
references made by the loop tiles are satisfied by the local memory module. 
Our formulation facilitates data partitioning straightforwardly. There are two 
cases to consider: systems without caches and systems with caches. This 
section addresses systems without caches. NUMA systems with caches are 
dealt with in Sect. 7. 

The compiler has two options to optimize communication volume in sys- 
tems without caches. The compiler can choose to make local copies of remote 

^ We note that the estimate of cumulative footprint size will be slightly inaccurate 
if the footprint is misaligned with the cache block. 
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data, or it can fetch remote data each time the data is needed. In the for- 
mer case, the compiler can use the same partitioning algorithms described in 
this paper for systems with caches, but it must also solve the data coherence 
problem for the copied data. This section addresses the latter case. 

The general strategy is as follows: The optimal loop partition L is first de- 
rived by minimizing the cumulative footprint size as described in the previous 
sections. Data partitioning requires the additional derivation of the optimal 
data partition D for each class of uniformly intersecting references from the 
optimal loop partition L. We derive the shapes of the data tiles D for each G 
corresponding to a specific class of uniformly intersecting references. A spe- 
cific data tile is chosen from the footprints corresponding to each reference 
in an uniformly intersecting set. We then place each loop tile with the data 
tiles accessed by it on the same processing node. 

Although the overall theory remains largely the same as described earlier 
we must make one change in the footprint size computation to reflect the 
fact that a given data tile is placed in local memory and data elements from 
neighboring tiles have to be fetched from remote memory modules each time 
they are accessed. Because data partitioning for distributed-memory systems 
without caches (see Fig. 3.3(b)) assumes that data from other memory mod- 
ules is not dynamically copied locally (as in systems with caches) , we replace 
the max ® min formulation by the cumulative spread a+ of a set of uniformly 
intersecting references. That is 

a^ = cumulativespreado (ai , . . . , a^) = 6^D, 
in which the element of 6^ is given by, 



K 



^ ruedT. (6y.^/g )] 1, . . . , d, 

r 



where = a^D*^, (Ir =>1 , . . . ,R and medr{hr^k) is the median of the offsets 
in the dimension. The rest of our framework for minimizing the footprint 
size applies to data partitioning if a is replaced by a+. 

The data partitioning strategy proceeds as follows. As in loop partition- 
ing for caches, for a given loop tile L, we first write an expression for the 
communication volume by deriving the size of that portion of the cumulative 
footprint not contained in local memory. This communication volume is given 
by 



d 

<=i 

We then derive the optimal L to minimize this communication volume. We 
then derive the optimal data partition D for each class of uniformly inter- 
secting references from the optimal loop partition L as described in the pre- 
vious section on systems with caches. A specific data tile is chosen from the 
footprints corresponding to each reference in an uniformly intersecting set. 
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In systems without caches, because a single data element might have to be 
fetched multiple times, the choice of a specific data footprint does matter. 
A simple heuristic to maximize the number of local accesses is to choose a 
data tile whose offsets are the medians of all the offsets in each dimension. 
We can show that using a median tile is optimal for one-dimensional data 
spaces, and close to optimal for higher dimensions. However, a detailed de- 
scription is beyond the scope of this paper. We then place each loop tile with 
the corresponding data tiles accessed by it on the same processor. 



7. Combined Loop and Data Partitioning in DSMs 

Until this point we have been assuming that in systems with caches the cost 
of a cache miss is independent of the actual memory location being accessed. 
In systems in which main memory is distributed with the processing nodes 
(e.g., see Fig. 3.3(a)), data partitioning attempts to maximize the probability 
that cache misses are satisfied in local memory, rather than suffer a remote 
memory access. Unfortunately, when a program has multiple loops that access 
a given data array, the possibility of the loops imposing conflicting data tiling 
requirements for caches arises. Furthermore, the partitioning that achieves 
optimal cache behavior is not necessarily the partitioning that simultaneously 
achieves optimal local memory behavior. This section describes a heuristic 
method for solving this problem. 

7.1 The Cost Model 

The key to a finding a communication-minimal partitioning is a cost model 
that allows a tradeoff to be made between cache miss cost and remote memory 
access cost. This cost model drives an iterative solution and is a function that 
takes, as arguments, a loop partition, data partitions for each array accessed 
in the loop, and architectural parameters that determine the relative cost of 
cache misses and remote memory accesses. It returns an estimation of the 
cost of array references for the loop. 

The cost due to memory references in terms of architectural parameters 
is computed by the following equation: 

T^total^access ocal )+Tc{n 

cache) 

where Tr, Tr, Tc are the remote, local and cache memory access times re- 
spectively, and riremote , niocai , ncache are the number of references that result 
in hits to remote memory, local memory and cache memory. Tc and Tr are 
fixed by the architecture, while Tr is determined both by the base remote 
latency of the architecture and possible contention if there are many remote 
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references. Tr may also vary with the number of processors based on the 
interconnect topology. 

ncache, niocai and riremote depend on the loop and data partitions. Given a 
loop partition, for each Ul-set consider the intersection between the footprint 
(LG) of that set and a given data partition D. First time accesses to data in 
that intersection will be in local memory while first time references to data 
outside will be remote. Repeat accesses will likely hit in the cache. A Ul-set 
may contain several references, each with slightly different footprints due to 
different offsets in the array index expressions. One is seleeted and ealled 
the base offset, or b. In the following definitions the symbol \will be used 
to compare footprints and data partitions. LG \D means that the matrix 
equality holds. This equality does not mean that all references in the Ul-set 
represented by G will be local in the data partition D because there may be 
small offset vectors for each reference in the Ul-set. 

We define the functions Rb,Ff and Fb, which are all functions of the loop 
partition L, data partition D and reference matrix G with the meanings 
given in previously. For simplicity, we also use Rb,Ff and Fb, to denote the 
value returned by the respective functions of the same name. 



Definition 71. Rb is a function which maps L, D and G to the number of 
remote references that result from a single aecess defined by G and the base 
offset b. 

In other words, Rb returns the number of remote accesses that result 
from a single program reference in a parallel loop, not including the small 
peripheral footprint due to multiple accesses in its Ul-set. The periphery is 
added using Fj to be described below. 

Note that in most cases G’s define Ul-sets: accesses to an array with the 
same G but different offsets are usually in the same Ul-set, and different G’s 
always have different UTsets. The only exception are accesses with the same 
G but large differences in their offsets relative to tile size, in which case they 
are considered to be in different Ul-sets. 

The computation of Rb is simplified by an approximation. One of the two 
following cases apply to loop and data partitions. 

1. Loop partition L matches the data partition D, i.e. LG \D. The ref- 
erences in the periphery due to small offsets between references in the 
Ul-set are considered in Fj. In this case Rb = 0. 

2. L does not match D. This is case where the G matrix used to compute 
D (perhaps from another Ul-set), is different from the G for the current 
access, and thus LG and D have different shapes, not just different offsets. 
In this case all references for L are considered remote and Rb = <S)et L<l|t 

This is a good approximation because LG and D each represent a regular 
tiling of the data space. If they differ, it means the footprint and data tile 
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differ in shape, and do not stride the same way. Thus, even if L’s footprints 
and D partially overlap at the origin, there will be less overlap on other 
processors. For a reasonably large number of processors, some will end up 
with no overlap as shown in the example in Fig. 7.1. Since the execution time 
for a parallel loop nest is limited by the processor with the most remote cache 
misses, the non-overlap approximation is a good one. 



Doall (i=0:100, j=0:75) 
B[i,j] = A[i,j] + A[i+j,j] 
EndDoall 



(a) Code fragment 



Rectangles : Footprints for A[i j] 
Parallelograms : Footprints for A[i+jJ] 



Processor 7’s footprints have no overlap 
(Numbering for footprints of A[i,j] 

0 100 175 




(b) Data space for Array A(8 processors) 
Fig. 7.1. Different Ul-sets have no overlap 



Definition 72. F), is the number of first time accesses in the footprint ofL 
with base offset b. Hence: 

Ff, = J^et LJjt 



Definition 73. Ff is the difference between (1) the cumulative footprints of 
all the references in a given UF set for a loop tile, and (2) the base footprint 
due to a single reference represented by G and the base offset h. Ff is referred 
to as the peripheral footprint. 

Theorem 71. The cumulative access time for all accesses in a loop with 
partition L, accessing an array having data partition D with reference matrix 
G in a Ul-set is 

TniRb + Ff) + Tl{Fi, 0 Ri,) -h Tc{nref 0 {Ff + Fb)) 
where nref is the total number of references made by L for the Ul-set. 
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This result can be derived as follows. The number of remote accesses 
nremote IS the iiumbcr of remote accesses with the base offset, which is i?t,, plus 
the size of the peripheral footprint Ff, giving riremote = Rb + Ff. The number 
of local references riiocai is the base footprint, less the remote portion, i.e. 
Fh^Rb- Finally, number of cache hits Ucache is clearly nref ^riremote ^niocai 
which is equal to nref 0 {Ff + Fb). 

Sub-blocking. The above cost model assumes infinite caches. In practice, even 
programs with moderate-sized data sets have footprints much larger than the 
cache size. To overcome this problem the loop tiles are sub-blocked, such that 
each sub-block fits in the cache and has a shape that optimizes for cache 
locality. This optimization lets the cost model remain valid even for finite 
caches. It turned out that sub-blocking was critically important even for 
small to moderate problem sizes. 

Finite caches and sub-blocking also allows us to ignore the effect of data 
that is shared between loop nests when that data is left behind in the cache 
by one loop nest and reused by another. Data sharing can happen in infinite 
caches due to accesses to the same array when the two loops use the same G. 
However, when caches are much smaller than data footprints, and the com- 
piler resorts to sub-blocking, the possibility of reuse across loops is virtually 
eliminated. 

This model also assumes a linear flow of control through the loop nests 
of the program. While this is the common case, conditional control flow can 
be handled by our algorithm. Although we do not handle this case now, an 
approach would be to assign probabilities to each loop nest, perhaps based 
on profile data, and to multiply the probabilities by the loop size to obtain 
an effective loop size for use by the algorithm. 

7.2 The Multiple Loops Heuristic Method 

This section describes the iterative method, whose goal is to discover a parti- 
tioning of loops and data arrays to minimize communication cost. We assume 
loop partitions are non-cyclic. Cyclic partitions could be handled using this 
method but for simplicity we leave them out. 

7.2.1 Graph Formulation. Our search procedure uses bipartite graphs to 
represent loops and data arrays. Bipartite graphs are a popular data structure 
used to represent partitioning problems for loops and data [9,22]. For a graph 
G = (VI, Vd, E), the loops are pictured as a set of nodes Vi on the left hand 
side, and the data arrays as a set of nodes Vd on the right. An edge e=E 
between a loop and array node is present if and only if the loop accesses 
the array. The edges are labeled by the uniformly intersecting set(s) they 
represent. When we say that a data partition is induced by a loop partition, 
we mean the data partition D is the same as the loop partition L’s footprint. 
Similarly, for loop partitions induced by data partitions. 
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7.2.2 Iterative Method Outline. We use an iterative local search tech- 
nique that exploits certain special properties of loops and data array parti- 
tions to move to a good solution. Extensive work evaluating search techniques 
has been done by researchers in many disciplines. Simulated annealing, gra- 
dient descent and genetic algorithms are some of these. All techniques rely 
on a cost function estimating some objective value to be optimized, and a 
search strategy. For specific problems more may be known than in the general 
case, and specific strategies may do better. In our case, we know the search 
direction that leads to improvement, and hence a specific strategy is defined. 
The algorithm greedily moves to a local minimum, does a mutation to escape 
from it, and repeats the process. 

The following is the method in more detail. To derive the initial loop 
partition, the single loop optimization method described in this paper is used. 
Then an iterative improvement method is followed, which has two phases in 
each iteration: the first (forward) phase finds the best data partitions given 
loop partitions, and the second (back) phase redetermines the values of the 
loop partitions given the data partitions just determined. 

We define a boolean value called the progress flag for each array. Specif- 
ically, in the forward phase the data partition of each array having a true 
progress flag is set to the induced data partition of the largest loop accessing 
it, among those which change the data partition. The method of controlling 
the progress flag is explained in section 7.2.4. In the back phase, each loop 
partition is set to be the data partition of one of the arrays accessed by the 
loop. The cost model is used to evaluate the alternative partitions and pick 
the one with minimal cost. 

These forward and backward phases are repeated using the cost model to 
determine the estimated array reference cost for the current partitions. After 
some number of iterations, the best partition found so far is picked as the 
final partition. Termination is discussed in Sect. 7.2.4. 

7.2.3 An Example. The workings of the heuristic can be seen by a simple 
example. Consider the following code fragment: 

Doall (i=0:99, j=0:99) 

A[i,j] = i * j 
EndDoall 

Doall (i=0:99, j=0:99) 

B[i, j] = A[j ,i] 

EndDoall 

The code does a transpose of A into B. The first loop is represented by 
X and the second by Y. The initial cache optimized solution for 4 processors 
is shown in Fig. 7.2. In this example, as there is no peripheral footprint for 
either array, a default load balanced solution is picked. Iterations 1 and 2 
with their forward and back phases are shown in Fig. 7.3. 
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Fig. 7.2. Initial solution to loop partitioning (4 processors) 
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Iteration 2 

Fig. 7.3. Heuristic: iterations 1 and 2 (4 processors) 



In iteration I’s forward phase A and B get data partitions from their 
largest accessing loops. Since both loops here are equal in size, the compiler 
picks either, and one possible choice is shown by the arrows. In I’s back phase, 
loop Y cannot match both A and B’s data partitions, and the cost estimator 
indicates that matching either has the same cost. So an arbitrary choice as 
shown by the back arrows results in unchanged data partitions; nothing has 
changed from the beginning. 

As explained in the next section, the choice of data partitions in the 
forward phase is favored in the direetion of ehange. So now array A picks 
a different data partition from before, that of Y instead of X. In the back 
phase loop X now changes its loop partition to reduce cost as dictated by 
the cost function. This is the best solution found, and no further change 
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occurs in subsequent iterations. In this case, this best solution is also the 
optimal solution as it has 100% locality. In this example a communication- 
free solution exists and was found. More generally, if one does not exist, the 
heuristic will evaluate many solutions and will pick the best one it finds. 

7.2.4 Some Implementation Details. The algorithm of the heuristic 
method is presented in Fig. 7.4. Some details of the algorithm are explained 
here. 

Choosing a data partition different from the current one is preferred be- 
cause it ensures movement out of local minima. A mutation out of a local 
minimum is a change to a possibly higher cost point, that sets the algorithm 
on another path. This rule ensures that the next configuration differs from 
the previous, and hence makes it unnecessary to do global checks for local 
minima. 

However, without a progress flag, always changing data partitions in the 
forward phase may change data partitions too fast. This is because one part 
of the graph may change before a change it induced in a previous iteration has 
propagated to the whole graph. This is prevented using a bias rule. A data 
partition is not changed in the forward phase if it induced a loop partition 
change in the immediately preceding back phase. This is done by setting its 
progress flag to false. 

As with all deterministic local search techniques, this algorithm could suf- 
fer from oscillations. The progress flag helps solve this problem. Oscillations 
happen when a configuration is revisited. The solution is to, conservatively, 
determine if a cost has been seen before, and if it has, simply enforce change at 
all data partition selections in the next forward phase. This sets the heuristic 
on another path. There is no need to store and compare entire configurations 
to detect oscillations. 

One issue is the number of iterations to perform. In this problem, the 
length of the longest path in the bipartite graph is a reasonable bound, since 
changed partitions in one part of the graph need to propagate to other parts 
of the graph. This bound seems to work well in practice. Further increases 
in this bound did not provide a better solution in any of the examples or 
programs tried. 



8. Implementation and Results 

This paper presents cumulative footprint size measurements from an algo- 
rithm simulator and execution time measurements for loop partitioning from 
an actual compiler implementation on a multiprocessor. See [23] for results 
on combined loop and data partitioning. 
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Procedure Do_forward_ptiase() 
for all d e Dataset do 
if ProgressJlag[d] then 

1 ^ largest loop accessing d which induces changed Data_partition[d] 
Data_partition[d] <— Partition induced by Loop_partition[l] 

Origin [d] <— Access function mapping of Origin [1] 

endif 

Inducing Joop[d] <— 1 

endfor 

end Procedure 

Procedure Do_back_phase() 
for all 1 G Loop_set do 

d ^ Array inducing Loop_partition[l] with minimum cost of accessing all 
its data 

Loop_partition[l] ^ Partition induced by Data_partition[d] 

Origin[l] ^ Inverse access function mapping of Origin[d] 
if Inducing Joop[d] yt 1 then 
Progress_flag[d] <— false 
endif 
endfor 

end Procedure 

Procedure Partition 

Loop_set : set of all loops in the program 

Data_set : set of all data arrays in the program 

Graph_G : Bipartite graph of accesses in Loop_set to Data_set 

Min_partitions <— <p 
Min_cost <— oo 
for all d G Dataset do 
ProgressJlag[d] ^ true 
endfor 

for i= 1 to (length of longest path in Graph_G) do 
Do_forward_phase ( ) 

Do_back_phase() 

Cost <— Find total cost of current partition configuration 
if Cost < Min_cost then 
Cost ^ Min_cost 

Min_partitions <— Current partition configuration 

endif 

if cost repeated then /* convergence or oscillation */ 

for all d G Data_set do /* force progress */ 

Progress_flag[d] ^ true 
endfor 
endif 
endfor 

end Procedure 



Fig. 7.4. The heuristic algorithm 
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8.1 Algorithm Simulator Experiments 

We have written a simulator of partitioning algorithms that measures the 
exact cumulative footprint size for any given hyperparallelepiped partition. 
The simulator also presents analytically computed footprint sizes using the 
formulation presented in Theorem 43. 

We present in Fig. 8.1 algorithm simulator data showing the communi- 
cation volume for array B in Example 33 (repeated below for convenience) 
resulting from a large number of loop partitions (with tile size 96) repre- 
senting both parallelograms and rectangles. The abscissa is labeled by the 
L matrix parameters of the various loop partitions, and the parallelogram 
shape is also depicted above each histogram bar. 

Doall (i=l:N, j = l:N) 

A[i,j] = B[i,j] + B[i+l,j-2] + B[i-l,j+l] 

EndDoall 

The example demonstrates that the analytical method yields accurate 
estimates of cumulative footprint sizes. The estimates are higher than the 
measured values when the partitions are mismatched with the offset vectors 
due to the overlapping sub-tile approximation described in Sect. 4.4. We can 
also see that the difference between the optimal parallelogram partition and 
a poor partition is significant. The differences become even greater if bigger 
offsets are used. This example also shows that rectangular partitions do not 
always yield the best partition. 

8.2 Experiments on the Alev^^ife Multiprocessor 

We have also implemented some of the ideas from our framework in a com- 
piler for the Alewife machine [24] to understand the extent to which good 
loop partitioning impacts end application performance, and the extent to 
which our theory predicts the optimal loop partition. The Alewife machine 
implements a shared global address space with distributed physical memory 
and coherent caches. The nodes contain slightly modified SPARC processors 
and are configured in a 2-dimensional mesh network. 

For NUMA machines such as Alewife, where references to remote mem- 
ory are more expensive than local references, partitioning loops to increase 
cache hits is not enough. A compiler must also perform data partitioning, 
distributing data so that cache misses tend to be satisfied by local memory. 
We have implemented loop and data partitioning in our compiler using an 
iterative method as described in [15]. Because this paper focuses on loop par- 
titioning, for the following experiments we caused the compiler to distribute 
data randomly. The effect is that most cache misses are to remote memory, 
simulating a UMA machine as depicted in Fig. 2.2, and the results offer in- 
sights into the extent to which good loop partitioning affects end application 
performance. 




Size Size Size 
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The performance gain due to loop partitioning depends on the ratio of 
communication to computation and other overhead. To get an understanding 
of these numbers for Alewife, we measured the performance of one loop nest 
on an Alewife simulator, and the performance of three applications on a 32- 
processor Alewife machine. 

Single Loop Nest Experiment. The following loop nest was run on a simulator 
of a 64 processor Alewife machine: 

Doall (1=0:255, j=4:251) 

A[i,j] = A[i-l,j] + B[i,j+4] + B[i,j-4] 

EndDoall 

The G matrix for the above loop nest is the 2^2 identity matrix, and 
the offset vectors are ai = (0, 0), a 2 = (0l, 0), bi = (0, 4), and b 2 = (0, (S)4). 
Each array was 512 elements (words) on a side. The cache line size is four 
words, and the arrays are stored in row-major order. 

Using the algorithms in this paper, and taking the four-word cache line 
size into account, the compiler chose a rectangular loop partition and deter- 
mined that the optimal partition has an aspect ratio of 2:1. The compiler 
then chose the closest aspect ratio (1:1) that also achieves load balance for 
the given problem size and machine size, which results in a tile size of 64x64 
iterations. We also ran the loop nest using suboptimal partitions with tile 
dimensions ranging from 8x512 to 512x8. This set of executions is labeled 
run A in Fig. 8.2. We ran a second version of the program using a different 
set of offset vectors that give an optimal aspect ratio of 8:1 (run B). This 
results in a desired tile size between 256x16 and 128x32 with the compiler 
choosing 256x16. 

Fig. 8.2 shows the running times for the different tile sizes, and demon- 
strates that the compiler was able to pick the optimal partitions for both 
cases. There is some noise in these figures because there can be variation in 
the cost of accessing the memory that is actually shared due to cache coher- 
ence actions, but the minima of the curves are about where the framework 
predicted. 

Application Experiments. The following three applications were run on a real 
Alewife machine with 32 processors. 

Erlebacher A code written by Thomas Eidson, from ICASE. It performs 3- 
D tridiagonal solves using Alternating Direction Implicit (ADI) 
integration. It has 40 loops and 22 arrays in one, two and three 
dimensions. 

Conduct A routine in SIMPLE, a two dimensional hydrodynamics code 
from Lawrence Livermore National Labs. It has 20 loops and 20 
arrays in one and two dimensions. 

Tomcatv A code from the SPEC suite. It has 12 loops and 7 arrays, all two 
dimensional. 
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tile size 



Fig. 8.2. Running times in lOOO’s of cycles for different aspect ratios on 64 
processors. 



As with the loop nest example, the programs were compiled with two dif- 
ferent methods of partitioning loops. The auto method used the algorithms 
described in this paper to partition each loop independently. The other meth- 
ods assigned a fixed partition shape to each loop: rows, squares or columns. 
The results are shown in Tables 8.1, 8.2 and 8.3. The cache-miss penalty 
for this experiment is small because the Alewife remote memory access time 
is rather short (about 40 cycles). Since we expect that the importance of 
good loop partitioning will increase with the cache-miss penalty, we also ran 
two other experiments with a longer remote delay of 100 and 200 cycles. 
Alewife allows longer delays to be synthesized by a combination of software 
and hardware mechanisms. 

These results show that the choice of partitioning parameters affects per- 
formance significantly. In all cases, the partitioner was able to discover the 
best partition. In two of the applications, the compiler’s partition choice re- 
sulted in a small improvement over squares. In Tomcatv, the compiler chose 
the same square partition for each loop, resulting in no improvement over the 
fixed square partition. The performance gains over squares for all of these pro- 
grams are modest because the offsets in most of the references in the three 
applications are similar. 
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Table 8.1. Execution time in Mcycles for Erlebacher (N = 64) 



Delay 


auto 


rows 


squares 


columns 


40 cycles 


27.0 


27.3 


28.6 


28.2 


100 cycles 


30.4 


31.4 


31.2 


31.3 


200 cycles 


34.0 


35.2 


36.4 


36.8 



Table 8.2. Execution time in Mcycles for Conduct (N = 768) 



Delay 


auto 


rows 


squares 


columns 


40 cycles 


67.2 


71.2 


71.4 


71.2 


100 cycles 


85.4 


91.2 


91.8 


90.8 


200 cycles 


111.4 


118.2 


117.1 


117.5 



Table 8.3. Execution time in Mcycles for Tomcat v (N = 1200) 



Delay 


auto 


rows 


squares 


columns 


40 cycles 


104 


127 


100 


113 


100 cycles 


125 


152 


122 


138 


200 cycles 


154 


188 


154 


174 



9. Conclusions 

The performance of cache-coherent systems is heavily predicated on the de- 
gree of temporal locality in the access patterns of the processor. If each block 
of data is accessed a number of times by a given processor, then caches will 
be effective in reducing network traffic. Loop partitioning for cache-coherent 
multiprocessors strives to achieve precisely this goal. 

This paper presented a theoretical framework to derive the parameters 
of iteration-space partitions of the do loops to minimize the communication 
traffic in multiprocessors with caches. The framework allows the partitioning 
of doall loops into optimal hyperparallelepiped tiles where the index expres- 
sions in array accesses can be any affine function of the indices. The same 
framework also yields optimal loop and data partitions for multicomputers 
with local memory. 

Our analysis uses the notion of uniformly intersecting references to cate- 
gorize the references within a loop into classes that will yield cache locality. A 
theory of data footprints is introduced to capture the combined set of data ac- 
cesses made by the references within each uniformly intersecting class. Then, 
an algorithm to compute precisely the total size of the data footprint for 
a given loop partition is presented. Once an expression for the total size of 
the data footprint is obtained, standard optimization techniques can be ap- 
plied to minimize the size of the data footprint and derive the optimal loop 
partitions. 

Our framework discovers optimal partitions in many more general cases 
than those handled by previous algorithms. In addition, it correctly repro- 
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duces results from loop partitioning algorithms for certain special cases pre- 
viously proposed by other researchers. 

The framework, including both loop and data partitioning for cache- 
coherent distributed shared memory, has been implemented in the compiler 
system for the Alewife multiprocessor. 
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A. A Formulation of Loop Tiles Using Bounding 
Hyperplanes 

A specific hyperparallelepiped loop tile is defined by a set of bounding hy- 
perplanes. Similar formulations have also been used earlier [6]. 

Definition Al. Given a I dimensional loop nest i, each tile of a hyperpar- 
allelepiped loop partition is defined by the hyperplanes given by the rows of 
the I ~^l matrix H and the column vectors 7 and A as follows. The parallel 
hyperplanes are h^i = 7^- and hji = + Xj, for 1 y j y I . An iteration 

belongs to this tile if it is on or inside the hyperparallelepiped. 

When loop tiles are assumed to be homogeneous except at the boundaries 
of the iteration space, the partitioning is completely defined by specifying the 
tile at the origin, namely (H,0, A), as indicated in Fig. A.l. For notational 
convenience, we denote the tile at the origin as L. 

Definition A2. Given the tile (H, 0, A) at the origin of hyperparallelepiped 
partition, let L = L(H) = where A is a diagonal matrix with An = 

Xi. We refer to the tile by the L matrix, as L completely defines the tile at 
the origin. The rows of L specify the vertices of the tile at the origin. 




(L ,L ) i 
21 22 



Fig. A.l. Iteration space partitioning is completely specified by the tile at the 
origin. 



B. Synchronization References 

Sequential do loops can often be converted to parallel do loops by intro- 
ducing fine-grain data-level synchronization to enforce data dependencies or 
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mutual exclusion. The cost of synchronization can be approximately modeled 
as slightly more expensive communication [14]. For example, in the Alewife 
system the inner loop of matrix multiply can be written using fine-grain 
synchronization in the form of the loop in Example B. 

Doall (i=l:N, j = l:N, k=l:N) 

l$C[i,j] = l$C[i,j] + A[i,k] + B[k,j] 

EndDoall 

In the code segment in Example B, the “1$” preceding the C matrix 
references denote atomic accumulates. Accumulates into the C array can 
happen in any order, just that each accumulate action must be atomic. Such 
synchronizing reads or writes are both treated as writes by the coherence 
system. Similar linguistic constructs are also present in Id [25] and in a variant 
of FORTRAN used on the HEP [26]. 
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Summary. This Chapter introduces several communication-free partitioning tech- 
niques of nested loops in literature. Since the cost of data communication is much 
higher than that of a primitive computation in distributed-memory multicomputers, 
it is important to reduce data communication. The ideal situation is to eliminate 
data communication when it is possible. During the last few years, many techniques 
investigating how to achieve communication-free by partitioning nested loops are 
proposed. This Chapter makes a survey of these techniques and points out the 
differences among them. 



1. Introduction 

In distributed-memory multicomputers, interprocessor communication is a 
critical factor affecting the overall system performance. Excessive interpro- 
cessor communication offsets the benefit of parallelization even if the pro- 
gram has a large amount of parallelism. The modern parallelizing com- 
piler must take not only parallelism but also the amount of communication 
into consideration. Traditionally, parallelizing compilers consider only how 
to restructure programs to exploit as large amount of parallelism as possi- 
ble [3,21,23,24,29,33-35,37]. However, exploiting parallelism may incur severe 
interprocessor communication overhead and degrade the system performance. 
It cannot ensure to achieve the highest performance even if the parallelism 
within a program has been totally exploited. Since there is a tradeoff between 
increasing parallelism and decreasing interprocessor communication, how to 
balance these two factors in order to achieve high performance computing 
has been, therefore, a crucial task for current parallelizing compilers. 

During the last few years, many researchers are aware of this fact and 
begin to develop compilation techniques taking both parallelism and com- 
munication into consideration. Because the cost of data communication in 
distributed-memory multicomputers is much higher than that of a prim- 
itive computation, the major consideration for distributed- memory multi- 
computers has shifted to distribute data to appropriate location in order to 
reduce communication overhead. Thus, in the previous work, a number of 
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researchers have developed parallelizing compilers that need programmers to 
specify data distribution. Based on the programmer-specified data distribu- 
tion, parallelizing compilers can automatically generate the parallel program 
with appropriate message passing constructs for multicomputers. Fortran D 
compiler project [13,14,31], SUPERB project [38], Kali project [19,20], DINO 
project [28], and Id Nouveau compiler [27] are all based on the same idea. 
The generated parallel program is mostly in SPMD (Single Program Multiple 
Data) [18] model. 

Recently, automatic data partitioning is a research topic of great interests 
in the field of parallelizing compilers. There are many researchers who develop 
systems to automatically determine the data distribution at compile time. 
PARADIGM project [11] and SUIF project [2,33] are all based on the same 
purpose. These systems can automatically determine the appropriate data 
distribution patterns to minimize the communication overhead and generate 
the SPMD code with appropriate message passing constructs for distributed 
memory multicomputers. 

An ideal situation about reducing interprocessor communication is to 
completely eliminate interprocessor communication, if possible. Commu- 
nication-free partitioning is an interesting and worth studying issue for 
distributed-memory multicomputers. In recent year, many researchers have 
paid their attention on automatically partitioning data and/or computation 
to processors to completely eliminate interprocessor communication. Based 
on hyperplane partitions of data spaces, the problem of communication-free 
partitioning is formulated in terms of matrix representation and the exis- 
tence of communication- free partitions of data arrays is derived by [26] . The 
approach using affine processor mapping for statements to distribute com- 
putation to processors without communication and maximize the degree of 
parallelism is presented in [22]. Two communication- free partitioning strate- 
gies, non-duplicate data and duplicate data partitionings, for a uniformly 
generated reference perfectly nested loop are developed in [5]. The necessary 
and sufficient conditions for the feasibility of communication-free hyperplane 
partitioning of iteration and data spaces in loop-level and statement-level are 
provided in [16] and [30], respectively. All methods directly manage affine ar- 
ray references. Neither of these methods uses information of data dependence 
distances or direction vectors. 

Basically, this Chapter tends to make a survey of communication-free par- 
titioning of nested loops. We roughly classify the communication-free parti- 
tioning as loop-level partitioning and statement-level partitioning according 
to the level that the partition is performed in loop-level or statement-level. 
The following methods will be presented in this Chapter: 

— Loop-Level Partitioning 

— Chen and Sheu’s method [5], which proposed two communication-free 
data allocation strategies for a perfectly uniformly generated references 
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loop, one disallows data to be duplicated on processors and another 
allows that. 

— Ramanujam and Sadayappan’s method [26], which proposed communi- 
cation-free hyperplane partitioning for data spaces. The computation 
partitioning is achieved by ” owner-computes rules” . 

— Huang and Sadayappan’s method [16], which proposed the sufficient and 
necessary conditions for the feasibilities of communication-free iteration 
spaces and data spaces partitionings. 

— Statement-Level Partitioning 

— Lim and Lam’s method [22], which proposed affine processor mappings 
for the communication-free statement-iteration spaces partitionings as 
well as maximizing the degree of parallelism. 

— Shill, Shcu, and Huang’s method [30], which proposed the sufficient and 
necessary conditions for communication-free statement-iteration spaces 
and data spaces partitionings. 

Therefore, we briefly describe each of these methods and point out the 
differences among them. The rest of the Chapter is organized as follows. 
Since the analysis of array reference is very important for communication- 
free partitioning. Section 2 analyzes the fundamentals of array reference and 
some important properties derived from array references. Section 3 consid- 
ers the loop- level communication- free partitioning. Section 3.1 presents non- 
duplicate data and duplicate data strategies proposed in [5]. Section 3.2 
describes the communication-free data partitionings along hyperplanes pro- 
posed in [26]. Section 3.3 proposes the sufficient and necessary conditions 
for communication-free iteration and data spaces partitionings along hyper- 
planes derived in [16]. Section 4 discusses the statement-level communication- 
free partitioning. Section 4.1 introduces the affine processor mappings for 
statement-iteration spaces to achieve communication-free partitioning and 
obtain maximum degree of communication- free parallelism proposed in [22]. 
Section 4.2 derives the sufficient and necessary conditions for communication- 
free statement-iteration and data spaces partitioning along hyperplanes pro- 
posed in [30] . Section 5 makes comparisons among these methods and indicate 
the differences of each method. Finally, concluding remarks are provided in 
Section 6. 



2. Fundamentals of Array References 

Communication- free partitioning considers iteration and/or data spaces par- 
titioning. Data elements in data spaces are referenced by different iterations 
in iteration spaces via array references. Therefore, analyses of array references 
are indispensable and very important for compilation techniques, including 
communication-free partitioning, in parallelizing compilers. In this section we 
will formulate the array references and present some useful properties that 
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will be frequently used in the Chapter. We use simple examples to demon- 
strate the notation and concepts of array references. The formal definitions 
and general extensions can be obtained accordingly. 

2.1 Iteration Spaces and Data Spaces 

Let Z denote the set of integers. The symbol represents the set of d- 
tuple of integers. Generally speaking, a d-nested loop forms a Z“* integer 
space. An instance of each loop index variable corresponds to a value in its 
corresponding dimension. The loop bounds of each dimension set the bounds 
of the integer space. This bounded integer space is called the iteration space 
of the d-nested loop. We denote the iteration space of a nested loop L as 
IS{L). An iteration is an integer point in the iteration space and includes all 
the execution of statements enclosed in the nested loop. Similar concept can 
be used for data spaces. Basically, a data space is also an integer space. An 
n-dimensional array forms an n-dimensional integer space. A data space of 
array v is denoted as DS{v). A data index indicates a data element of that 
index. 

Example 21. Consider the following program. 

do = 1, N 
do ^2 = 1, A^ 

A{ii (g) i2, (g)ii + 12) = A{ii (g) i2 0 1 , 0 ii + *2 + 1 ) (-^2.1) 

enddo enddo 

L2.1 is a 2 -nested loop. The iteration space of L2.1 is the 7? integer space 
and each dimension is bounded by 1 to N . We can formulate the iteration 
space IS{L2a) in set representation as IS{L2.i) = ^ 

N^. The superscript t is a transpose operator. Fig. 2.1 shows an example of 
iteration space of loop ^2.15 assuming N = 5. The column vector I = 

h 



A 

Fig. 2.1. Iteration space of loop L2.1, IS{L2.i), assuming N = 5. 
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for some instance of ii and i2, is called an iteration in iteration space IS{L2.i), 
where 1 ^N. Since array A is 2-dimensional, DS{A) is the two- 

dimensional integer space. Any integer point indexed as D = in 

DS{A) represents an array element A{Di, 02). 



2.2 Reference Functions 



Iteration spaces establish relations with data spaces via array references. 
Since the references of most programs are affine functions, we will consider 
only affine references. An array reference can be viewed as a reference function 
from the iteration space to the data space of that array. Each iteration is 
mapped to a data index in data space by means of a specified array reference. 
For example, reconsider i2.i- The iteration / = [^1,^2]* is mapped to the 
array index D = (8> ^2 <8> 1, + *2 + 1]* by means of the array reference 

A{i\ 0 ^2 iS> 1, 0*1 + *2 + !)• Let the reference function be denoted as Ref. 
The reference function Ref can be written as Ref {I) = D. In other words. 



Ref 

*2 



*1 0 *2 < 8 > 1 
0*1 + *2 + 1 



We can represent the reference function in matrix form by separating coeffi- 
cients, loop index variables and constant terms. For example, 



Ref 



*1 

*2 



1 0l *1 _i_ 

01 1 h ^ 1 



The coefficient matrix and constant vector are termed reference matrix and 
reference vector, respectively. Let the reference matrix and reference vector 
be respectively denoted as R and r. The reference function can, therefore, 
be written as Ref {I) = R ^ + r = D. For another array reference A(i\ 0 
*2,0*1 + *2) in loop L2.1, the reference matrix and reference vector are as 
follows. 



R = 



1 01 
01 1 



, and r = 



0 

0 



Once the array reference has been formulated, the reference function re- 
veals a number of interesting and important properties. We present these 
properties in the following section. 



2.3 Properties of Reference Functions 

Communication-free partitioning requires that data be allocated to the pro- 
cessor which uses them. The requirement is applied to both read and written 
data elements. Therefore, we are interested in how data elements are accessed 
by iterations. Based on the above representation, we consider two situations: 
(1) two iterations referencing the same data element and (2) two data ele- 
ments referenced by the same iteration. 
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For the same array variable, consider two array references Ref {I) = R < 
I + r and Ref (I) = R ^ + r . Suppose Ji and R are two different iterations. 
Let Refih) = D,, Ref{h) = D2, Ref (h) = D„ and Ref {h) = D^. 
Any two of four data elements can be equal; hence, there are total C| = 6 
different cases: Di = D2, = D2, Di = D2, = D2, Di = D-^, and 

Z?2 = However, some are synonymous. D\ = D2 and = Z?2 are 
the case that, for the same array reference, two different iterations reference 
the same data element. We denote this situation as self-dependent relation. 
Di = Z?2 s-iid = Z?2 arc the case that two different iterations reference 
the same data element via two different array references. This condition is 
termed cross-dependent relation. Note that D\ = and D2 = D2 are also 
synonymous since they all consider the case that an iteration will reference 
the same data element via different array references. However, we will not 
consider this case because the two references Ref and Ref are identical in 
the case of affine reference functions. Hence, the following conditions need to 
be well-maintained for all kinds of compilation techniques: 

— Self-Dependent Relation. For the same array reference, on what conditions 
two different iterations will reference the same data element. 

— Cross- Dependent Relation. For two different array references, on what con- 
ditions two different iterations will reference the same data element. 

Self-Dependent Relation Reference functions can be or cannot be an in- 
vertible function. Obviously, for some array reference, it is possible and often 
happens that two iterations reference the same data element. It implies that 
the reference function is not a one-to-one function, of course not an invertible 
function, too. In other words, these two iterations are data dependent. 

For some reference function Ref{I) = i? ^ -f r, suppose I and I are two 
different iterations and D and D are referenced respectively by I and I , 
i.e., Ref{I) = D and Ref {I ) = D . We care about on what conditions two 
different iterations will reference the same data element, i.e., D = D. Since 
D = D, we have 

Ref {I ) = Ref{I) 

^ R^+r = R^ + r 
^ R 0 /) = 0 
^ {I^I)^NS{R), 

where NS{R) is the null space of R. It implies that if two iterations reference 
the same data element, then the difference of the two iterations must belong 
to the null space of R. Actually, this condition is a sufficient and necessary 
condition. That is, it is also true that two iterations will reference the same 
data element if the difference of the two iterations belongs to the null space 
of R. We conclude above by the following lemma. 

Lemma 21. For some reference function Ref{I) = R FI + t , I and I are 
two different iterations and D and D are referenced respectively by I and I , 
i.e., Ref {I) = D and Ref {I ) = D . Then 
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Ref {I ) = Ref (I) (/ 0 I) /NS{R). (2.1) 

From Lemma 21, it also implies that if NS{R) = the reference function 
Ref {I) = R ^ + r has no self-dependent relation. 

Example 22. Take the array reference A[i\ 0 i2,0*i + * 2 ) in loop L 2.1 as 
an example. Iterations [1,1]*, [2,2]*, [3,3]*, [4,4]*, [5,5]* all reference the 
same data element A(0,0). Clearly, the difference of any two iterations be- 
longs to ^c[l,l]*# ZO The reference matrix of A{i\ 0 Z2,0*i + * 2 ) is 

R = • The null space of R is NS{R) = 1[c[l, l]*4b ^Z<) This 

fact matches the result obtained from Lemma 21. The above result is also 
illustrated in Fig. 2.2. 



k 




Fig. 2.2. Illustration of self-dependent relation. 



Cross-Dependent Relation If there exist two different array references, 
the data dependence relations between these two array references can be 
regarded as cross data dependence relations. The cross data dependence re- 
lations also can affect the placement of data elements if communication-free 
execution is required. Therefore, the cross data dependence relations also 
need to be identified. 

For the same array variable, consider two array references Ref {I) = R < 
I + r and Ref (I) = R ^ + r . Suppose I and I are two different iterations. 
Without loss of generality, assume Ref{I) = R ^ + r = D and Ref (/ ) = 
R ^ +r = D . The cross data dependence relations happen when D = D, 
that is. Ref (/ ) = Ref (I). It implies 

Ref (/ ) = Ref (I) 

<= R^+r=R^ + r 
<^= R^®R<I = r®r. 

In other words, for any pair of I and I that satisfies the above equation, / 
and I are data dependent. The above results are concluded in the following 
lemma. 
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Lemma 22. For two different reference funetions Ref (I) = R <I + r and 
Ref (I) = R <I + r , I and I are two different iterations and D and D 
are referenced respectively by I and I , i.e., Ref (I) = D and Ref {I ) = D . 
Then 

Ref {!) = Ref{I)\^ R <I ® R< = r®r . (2.2) 

Actually, Lemma 21 is a special case of Lemma 22 when R = R and 
r = r. Another interesting special case but more general than self-dependent 
relation is when R = R but r li R = R and r ^r, Eq. (2.2) is changed 
to 

R^I ®I)—r®r. (2.3) 

Solving Eq. (2.3) can obtain the following results. Suppose Ip is a particular 
solution of Eq. (2.3), i.e.. Ip satisfies R <Ip = r ®r . The general solution of 
Eq. (2.3) is Ig = Ip + NS{R). Consequently, I ^ I = Ip + NS{R). It implies 
that given an iteration I, for all iterations I such that / 0 / = Ip + N S{R), I 
and I are data dependent, where Ip is a particular solution oi R^ = r ®r . 
It is possible that there is no particular solution at all. This fact reflects 
that there exists no cross-dependent relation between the reference functions 
R <I = r and R <I = r . We conclude the above results by the following 
lemma. 

Lemma 23. For two different referenee functions with the same reference 
matrix R, Ref {I) = R <I + r and Ref [1) = R<I + r , I and I are two 
different iterations and D and D are referenced respectively by I and I , i.e., 
Ref{I) = D and Ref {I ) = D . If there exists a particular solution Ip that 
satisfies Eq. (2.3), then 

Ref (I ) = Ref (I) (/ 0 /) = Jp + NS{R). (2.4) 



4 




Fig. 2.3. Illustration of cross-dependent relation. 
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Example 23. Reconsider loop L 2.1 • There exists two different array references. 
One is A{ii (g) Z 2 , + 12 ) and another is A[ii 0 ^2 0 1, 0*i + 12 + !)■ The 

reference functions of these two array references have the same reference 

matrix R = . • The reference vectors of the former is r = [0, 01* 

01 1 ^ ^ 

and the latter is r = [ 0 l, 1]*. The null space of R is NS{R) = I]*# 

A particular solution of Eq. (2.3) is Ip = [1,0]*. By Lemma 23, for two 

array references with the same reference matrix, given an iteration /, the 

data element referenced by / via one array reference will be referenced by / 

via another array reference, where I satisfies Eq. (2.3). Fig. 2.3 illustrates 

the data dependence relations in IS{L2.i)- Note that since data dependence 

relation is transitive, therefore, the data dependence relations that can be 

obtained by transitive property are omitted in this figure. 

Iteration-Dependent Space An iteration- dependent space is a subspace of 
iteration space and iterations within this space are directly or indirectly de- 
pendent. Trivially, an iteration-dependent space contains the spaces obtained 
by self-dependent relations and cross-dependent relations. Clearly, all itera- 
tions within an iteration-dependent space should be executed by a processor if 
communication-free execution is required. If iteration-dependent space spans 
the whole iteration space, that means all iterations in the iteration space 
have to be executed by one processor. It implies that the nested loop has 
no communication-free partition if parallel processing is required. That is, 
the execution of the nested loop do involve interprocessor communication 
on multicomputers. Let iteration-dependent space be denoted as IDS{L), 
where L is a nested loop. The iteration-dependent space of L 2 . 1 , IDS{L2a), 
is span(^[l, 1]*, [1,0]*<D), where span{S) is the set of all linear combinations 
of vectors in set S [15]. It implies that IDS{L2.i) spans Z^. Obviously, the 
iteration-dependent space of L 2 . 1 , IDS{L2.i), spans the whole iteration space 
of ^ 2 . 1 - In other word, the whole iterations in iteration space IS{L2.i) should 
be executed by one processor. It implies loop L 2.1 should be executed sequen- 
tially if communication-free execution has to be satisfied. Fig. 2.3 also shows 
the iteration-dependent space of L 2 .i- 



3. Loop-Level Partitioning 

3.1 Iteration and Data Spaces Partitioning — Uniformly 
Generated References 

In this section, we will present the method proposed in [5]. This method is 
denoted as Chen and Sheu’s method. On the premise that no interprocessor 
communication is allowed, they find iteration space and data space partitions 
and exploit the maximum degree of parallelism. Since the cost of an interpro- 
cessor communication in distributed-memory multicomputers is much higher 
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than that of a primitive computation, consequently, they would prefer to 
lose some parallelism in order not to incur any interprocessor communication 
rather than exploit the maximum amount of parallelism but result in severe 
interprocessor communication. Chen and Sheu’s method considers a special 
loop model, a perfectly nested loop with uniformly generated references. This 
loop model possesses good quality in analyzing communication-free parti- 
tioning. Based on these good characteristics, they presented two loop-level 
communication-free partitioning strategies: one is non-duplicate data strategy 
and another is duplicate data strategy. In addition, they also derive the suf- 
ficient conditions of communication-free partitioning without duplicate data 
and with duplicate data strategies. 

3.1.1 Non-duplicate Data Strategy. Non-duplicate data means that 
each data element is distributed to exactly one processor. 

Example 31. Consider the following loop. 

do i\ = 1, N 
do i 2 — 1, N 

A{ii (g) i2, ®ii -h 12) = B{ii42) + C{ii42) (^s.i) 

B{ii,i2) = A{ii +* 2 ) + C{ii 0 1,*2 0 1 ) 

enddo enddo 

The loop model considered by Chen and Sheu’s method is a perfectly 
nested loop with uniformly generated reference. A nested loop is uniformly 
generated reference if all the reference matrices of the same array variable 
are of the same form. Obviously, loop L3.1 is a uniformly generated reference 
loop. Since iterations in an iteration-dependent space are data dependent, 
an iteration-dependent space should be distributed to a processor if program 
execution would not involve any interprocessor communication. 

Iteration-dependent space contains self-dependent relations and cross- 
dependent relations introduced in Section 2.3. Self-dependent relations can be 
obtained by Lemma 21. The null space of is NS{R^) = ^c[l, !]*♦ 
According to Lemma 21, any two iterations I and I that I ® I y'NS(R^) 
are data dependent. Therefore, the space obtained by self-dependent rela- 
tion resulted from array A is span(^[l, Similarly, for the array vari- 

ables B and C, the null spaces of R^ and are the same and equal 
NS(R^) = NS(R^) = It implies that no two iterations will reference 
the same data element of array B via the same array reference, and so 
is array C. The self-dependent relation of loop Ls.i should include all the 
self-dependent relations resulted from different array variables. As a result, 
the self-dependent relation of loop L3.1 is the space spanned by NS(R^), 
NS{R^) and NS{R^). The space obtained via self-dependent relations of 
loop L3.1 is .span{%l, 1]*0). 

Cross-dependent relations consider the same array variable but different 
array references. Any two array references of the same array variable are 
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considered. Since the nested loop considered is a perfectly nested loop with 
uniformly generated reference, the reference matrices of the same array vari- 
able are of the same form. Therefore, the cross-dependent relations can be 
obtained by Lemma 23. For array variable C, the reference vectors of the two 
array references are r = [0,0]* and r = [0l,0l]*, respectively. We have to 
solve Eq. (2.3), that is, 



0 1 = 



1 

1 



Obviously, there exists a particular solution. Ip = [1, 1]*. Since the null space 
of i?*" is NS{R'") = 2 :^ by Lemma 23, any two different iterations I and 
I that I ® I = Ip are data dependent. Therefore, the space obtained by 
cross-dependent relation resulted form array variable C is span{%l, l]*^). 

For array variables A, although there are two array references, the two 
array references are actually the same. The equation needs to be solved is 
0 /) = 0, which is reduced to the self-dependent case. Hence, while 
(/ 0/) /'N S{R^), these two iterations will reference the same data elements 
by means of the two array references. Thus the space obtained by cross- 
dependent relation resulted form array variable A is NS{R^) = ^c[l, !]*♦ ^ 
Z<[) Array variable B has the same situation with array variable A. Therefore, 
the space obtained by cross-dependent relation resulted form array variable B 
is NS{R^) = Similar to the self-dependent relation, the cross-dependent 
relations also have to consider all cross-dependent relations obtained from 
different array variables. Synthesizing the above analyses, the space obtained 
by cross-dependent relations for loop Ls.i is span{%l, 1]*^. 

As Section 2.3 described, iteration-dependent space includes the self- 
dependent relations and cross-dependent relations. Since the spaces ob- 
tained by self-dependent relations and cross-dependent relations are the 
same, span{%l, l]*^), the iteration-dependent space of loop L 3 . 1 , IDS{L^,i), 
is therefore equal to span(^[l, l]*^). We have found the basis of iteration- 
dependent space, which is ^[1, 1]*0 

Based on the finding of iteration-dependent space, the data space par- 
titions can be obtained accordingly. Basically, all data elements referenced 
by iterations in the iteration-dependent space are grouped together and dis- 
tributed to the processor where the iteration-dependent space is distributed 
to. Suppose that there are k different reference functions Refi, 1 ^k, 

that reference the same array variable in loop L and IDS{L) is the iteration- 
dependent space of loop L. The following iteration set and data set are allo- 
cated to the same processor. 

^I^I /'IDS{L)<}, and •iDjp = Refi{I),tI /IDS{L) and f = 1, 2, . . . , fc<> 

The above constraint should hold true for all array variables in loop L. 

Suppose !F(L) is the iteration space partitioning and I‘{L) is the data 
space partitioning. Let Ref]! denote the /c*** array reference of array variable 
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V in loop L and D"" denote the data index in data space of v. For this example, 
given linitj an initial iteration, the following sets are distributed to the same 
processor. 

^ RefB(l) or DB = RefB{I), ^^{Ls.i)0 
= i?e/f (/) or = i?e/f (/), ^/ ^^L 3 .i )0 - 

Fig. 3.1 illustrates the non-duplicate data communication-free iteration and 
data allocations for loop Ls.i when linit = [1, !]*• Fig. 3.1 (a) is the iteration 
space partitioning on iteration space of loop Fs.i, IS{L 3 ,\), where 

linit — [1)1]*- Fig- 3-1 (b), (c), and (d) are the data space partitionings 
^(is.i) on data spaces of A, B, and C, respectively. If ^"(Ls.i) and 
are distributed onto the same processors, communication-free execution is 
obtained. 

3.1.2 Duplicate Data Strategy. The fact that each data can be allocated 
on only one processor reduces the probability of communication-free execu- 
tion. If the constraint can be removed, it will get much improvement on the 
findings of communication-free partitionings. Since interprocessor communi- 
cation is too time-consuming, it is worth replicating data in order to obtain 
higher degree of parallelism. 

Chen and Sheu’s method is the only one that takes data replication into 
consideration. Actually, not all data elements can be replicated. The data 
elements that incur output, input, and anti-dependences can be replicated. It 
is because only true dependence results data movements. Output, input, and 
anti-dependences affect only execution orders but no data movements. Hence, 
the iteration-dependent space needs considering only the true dependence 
relations. 

Chen and Sheu’s method defines two terms, one is fully duplicable array, 
another is partially duplicable array, to classify arrays into fully or partially 
duplicable arrays. An array is fully duplicable if the array involves no true 
dependence relations; otherwise, the array is partially duplicable. For a fully 
duplicable array, which incurs no true dependence relation, since all iterations 
use the old values of data elements, not the newly generated values; there- 
fore, the array can be fully duplicated onto all processors without affecting 
the correctness of execution. For a partially duplicable arrays, only the data 
elements which involve no true dependence relations can be replicated. 

Example 32. Consider the following loop. 

do ii = 1, 

do i 2 = 1, N 

A{ii,i2) = A(i\ -\- l,i2) + A(i\,i2 + 1) (A 3 . 2 ) 

enddo enddo 
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Fig. 3.1. Non-duplicate data communication-free iteration and data allo- 
cations for loop La.i while linit = [1)1]*- (a) Iteration space partitioning 
if'(L3,i) on IS{L 3 ,i), where linit = [1, !]*• (b) Data space partitioning <?(L3,i) 
on DS{A). (c) Data space partitioning on DS{B). (d) Data space 

partitioning on DS{C). 



Loop L3.2 is a perfectly nested loop with uniformly generated reference. 
Suppose that the non-duplicate data strategy is adopted. Form Section 3.1.1, 
the iteration-dependent space of loop ^3,2 is IDS{L^, 2 ) = span(^[(S>l, 0]*, 
[O,0l]*, [l,0l]*'C>), where the self- and cross-dependent relations are ~and 
span(^[0l, 0]*, [0, 0l]*, [1, ^l]*^), respectively. Obviously, IDS{L^, 2 ) spans 
IS{L^, 2 )- It means that if sequential execution is out of consideration, loop 
^3,2 exists no communication-free partitioning without duplicate data. 

If the duplicate data strategy is adopted instead, loop L3.2 can be fully 
parallelized under communication-free criteria. The derivation of the result 
is as follows. As explained above, the data elements that incur output, in- 
put, and anti-dependences do not affect the correctness of execution and can 
be replicated. On the other hand, the data elements that incur true depen- 
dences will cause data movement and can be replicated. For this example, 
the data dependent vectors obtained are [0l,O]*, [O,0l]*, and [l,(g)l]*. The 
data dependent vectors and [0, are anti-dependence and [l,0l]* 
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is an input dependence. Since array A incurs no true dependence, array A 
is a fully duplicable array. Therefore, the iteration-dependent space contains 
no dependent relations that is incurred by true dependence relations. Thus, 
IDS{L^ 2 ) = — It implies that if array A is replicated onto processors ap- 
propriately, each iteration can be executed separately and no interprocessor 
communication is incurred. The distributions of iterations and data elements 
are as follows. Given an initial iteration, Imit, the following sets are mapped 
to the same processor. 

^(L3.2) = in *D = Reh{I) or D = Ref 2 ( 1 ) or D = Refs{I), 

I /^{Ls,2)<>. 

Fig. 3.2 shows the duplicate data communication- free allocation of iterations 
and data elements for loop T 3 2 when linit = [3, 3]*. Fig. 3.2 (a) is the iteration 
space partitioning on iteration space of loop ^ 3 . 2 , IS{L^^ 2 ), where 

linit = [3,3]*. Fig. 3.2 (b) is the data space partitioning <^(^ 3 , 2 ) on data 
space of A. Note that the overlapped data elements are the duplicate data 
elements. 





(b) 



Fig. 3.2. Duplicate data communication- free iteration and data allocations 
for loop L^ 2 when linu = [3,3]*. (a) Iteration space partitioning 'R{L^ 2 ) on 
IS{L^ 2 ), where linu = [3, 3]*. (b) Data space partitioning ^{L^ 2 ) on DS{A). 



The duplicate data strategy does not always promise to obtain higher 
degree of parallelism. It is possible that applying the duplicate data strategy 
is in vain for increasing the degree of parallelism. 

Example 33. Reconsider loop Ts.i- According to the previously analyses, ar- 
ray A can cause self- and cross-dependence relations and the dependence 
vectors are t[l, 1], where t /'T^. Array B causes no self- and cross-dependence 
relations. Array C causes no self-dependence relations but cross-dependence 




Communication-Free Partitioning of Nested Loops 353 



relations and the dependence vectors are [1,1]. Array A can involve true de- 
pendence and array C involves only input dependence. Obviously, array B 
involves no dependence. Therefore, array A is a partially duplicable array 
and arrays B and C are fully duplicable arrays. 

Since fully duplicable arrays invoke no data movement, the dependence 
vectors caused by fully duplicable arrays can be ignored by way of replica- 
tion of data. However, true dependence vectors caused by partially duplica- 
ble arrays do cause interprocessor communication and must be included in 
the iteration-dependent space. Consequently, the iteration-dependent space is 
IDS(L^ i) = span{^[l, 1]*^, which is the same as the result obtained in Sec- 
tion 3.1.1. Clearly, the degree of parallelism is still no improved even though 
the duplicate data strategy is adopted. 

We have mentioned that only true dependence relations can cause data 
movement. Nevertheless, output dependence relations cause no data move- 
ment but data consistency problem. Output dependence relations mean that 
there are multiple-writes to the same data elements. It is feasible to dupli- 
cate data to eliminate the output dependence relations. However, the data 
consistency problem is occurred. How to maintain the data to preserve their 
consistency is important for the correctness of the execution. Although there 
are multiple-writes to the same data element, only the last write is needed. 
Clearly, multiple-writes to the same data element may exist redundant com- 
putations. Besides, these redundant computations may occur unwanted data 
dependence relations and result in the losing of parallelism. Eliminating these 
redundant computations can remove these unwanted data dependence re- 
lations simultaneously and increase the degree of parallelism. In order to 
exploit more degrees of parallelism, Chen and Sheu’s method proposed an- 
other scheme to eliminate redundant computations. Eliminating redundant 
computations is a preprocessing step before applying the communication-free 
partitioning strategy. However, the scheme to eliminate redundant computa- 
tions is complex and time-consuming. The tradeoff on whether to apply or 
not to apply the scheme depends on the users. Since the scope of elimination 
of redundant computations is beyond the range of the chapter, we omit the 
discussions of the scheme. Whoever is interested in this topic can refer to [5]. 



3.2 Hyperplane Partitioning of Data Space 

This section introduces the method proposed in [26] . We use Ramanujam and 
Sadayappan’s method to denote this method. Ramanujam and Sadayappan’s 
method discusses data spaces partitioning for two-dimensional arrays; never- 
theless, their method can be easily generalized to higher dimensions. They 
use a single hyperplane as a basic partitioning unit for each data space. Data 
on a hyperplane are assigned to a processor. Hyperplanes on data spaces are 
called data hyperplanes and on iteration spaces are called iteration hyperplane. 
Hyperplanes within a space are parallel to each other. In Ramanujam and 
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Sadayappan’s method, iteration space partitioning is not addressed. However, 
their method implicitly contains the concept of iteration hyperplane. 

The basic ideas of Ramanujam and Sadayappan’s method are as follows. 
First, the data hyperplane of each data space is assumed to be the standard 
form of hyperplane. The coefficients of data hyperplanes are unknown and 
need to be evaluated later on. Based on the array reference functions, each 
data hyperplane can derive its corresponding iteration hyperplane. That is, 
all iterations referencing the data elements on the data hyperplane are on 
the iteration hyperplane. Since communication-free partitioning is required, 
therefore, all iteration hyperplanes derived from every data hyperplanes actu- 
ally represent the same hyperplane. In other words, although these iteration 
hyperplanes are different in shape, they all represent the iteration hyperplanes 
of the iteration space. Hence, if interprocessor communication is prohibited, 
these iteration hyperplanes should be the same. As a result, conditions to 
satisfy the requirement of communication-free partitioning are established. 
These conditions form a linear system and are composed of the coefficients 
of data hyperplanes. Solving the linear system can obtain the values of the 
coefficients of data hyperplanes. The data hyperplanes are then determined. 
Since iteration space partitioning is not considered by this method, it results 
in the failure in applying to multiple nested loops. Furthermore, their method 
can deal with only fully parallel loop, which contains no data dependence re- 
lations within the loop. 

Example 34 - Consider the following program model. 

do ii = 1, N 
do i 2 = I, N 

A(fi, i 2 ) = + b\^ 2 i 2 + ^i, 0 ) ^ 2 .iH + ^2,2*2 + ^<2,0) (As. 3) 

enddo enddo 

Let v\ denote A and V2 denote B, and so on, unless otherwise noted. Dj 
denote the component of array reference of array variable Vi. Ramanujam 
and Sadayappan’s method partitions data spaces along hyperplanes. A data 
hyperplane on a two-dimensional data space DS{vi) is a set of data indices 
%D\, + 9ID2 = c*Oand is denoted as where 9 \ and 02 are 

hyperplane coefficients and c* A’Q is the constant term of the hyperplane. All 
elements in a hyperplane are undertaken by a processor, that is, a processor 
should be responsible for the executions of all computations in an iteration 
hyperplane and manage the data elements located in a data hyperplane. 
Note that the hyperplanes containing at least one integer-valued point are 
considered in the Chapter. 

As defined above, the data hyperplanes for array variables vi and V2 are 
= %Dl,D^YJHlDl+9lDl = ci^and ^2 = %Dl + 

respectively. Since the array reference of vi is (^1,^2); hence, D\ = ii and 
Z?2 = *2- The array reference of V2 is A)|), where 
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= ^1,1*1 + 61,2*2 + 61,0, 

D2 = 62,1*1 + 62,2*2 + 62,0- 

Substituting the loop indices for data indices into the data hyperplanes can 
obtain the following two hyperplanes. 

6\D\ + e\D\= 

^ 6 \ii -h 9 \i 2 = c\ 

9\D\ + 9^0 2 = (? 

^ (6*161,1 -h 0^62, l)*l + (0 i6i,2 + 6262, 2)*2 = (? ® 0f6i,o 0 6*162, 0- 

From the above explanation, these two hyperplanes actually represent the 
same hyperplane if the requirement of communication-free partitioning has 
to be satisfied. It implies 

9 \ = 6*f6i,i -|- (?2^2 ,i 

61,2 + 0|62,2 

- 0 6*f6i,o 0 6*^62, 0 

We can rewrite the above formulations in matrix representation as follows. 
6*1 ( 61,1 62,1 of 9 f ( 

- =- 61,2 62,2 0 9^2 ] ( 3 . 1 ) 

^ 061,0 062,0 1 ^ ^ 

By the above analyses, tne two data hyperplanes or i>i and V 2 can be repre- 
sented as below. 

= 1[I?l,I?l]*40f6i,i + 9ib2,l)Dl + (0?6i,2+ 9lb2,2)D^2 = 

0 6*f6i,o 0 6>|62,oCi 

<^2 = + elDl = c^O 

A comprehensive methodology for communication-free data spaces par- 
titioning proposed in [26] has been described. Let’s take a real program as 
an example to show how to apply the technique. In the preceding program 
model, we discussed the case that the number of different array references 
in the r6s(right hand side) of the assignment statement is just one. If there 
are multiple array references in the rhs of the assignment statement, the 
constraints from Eq. (3.1) should hold true for each reference functions to 
preserve the requirements of communication-free partitioning. 

Example 35 . Consider the following loop. 

do fi = 1, iV 
do Z 2 = 1, 

A(fi , *2) = B{ii -|- 2z2 + 2, *2 + 1) + B( 2 i\ -|- i2i *1 0 1) (As. 4) 

enddo enddo 
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Suppose the data hyperplanes of v\ and i >2 are and respectively, 
where = %D\,DlYil£\D\ + B\DI = c^Oand 

0\D\ = c^(}. Since this example contains two different array references in the 
rhs of the assignment statement, therefore, by Eq. (3.1), we have the following 
two constraints for the first and the second references. 




0l 

oi 



The parameters 0^, 9f, c^, and have to satisfy the above equations. 

Solving these equations can obtain the following solution 9l = 9^ = 0^ = 
and + 0j. Therefore, the communication- free data hyperplanes on 

DS{vi) and DS{v 2 ) are respectively represented in the following: 

= tDhDlYJUlDl + 9\Dl = 

^2 = ® 9\Dl = cl + 010 

Fig. 3.3 illustrates and the communication-free data spaces partition- 
ing, for loop is . 4 when 0( = 1 and ci = 3. 



D 



Dt 




o o o o o 
o o o o o 
o o o o o 
o o o 
o o o 




D] -D\ =4 



(a) 



(b) 



Fig. 3.3. Communication- free data spaces partitioning for loop T 3 . 4 . (a) Data 
hyperplane on data space DS(vi). (b) Data hyperplane on data space 
DS(v2). 



After having explained these examples, we have explicated Ramanujam 
and Sadayappan’s method in detail. Nevertheless, all the above examples are 
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not general enough. The most general program model is considered below. 
Based on the same ideas, the constraints for satisfying the requirements of 
communication-free hyperplane partitioning is derived. An example is also 
given to illustrate the most general case. 

Example 36 . Consider the following program model. 

do ii = 1 , N 
do i2 — N 

A{a\^lil+ 01^2*2 + Ol,05 02,1*1 + 02,2*2 + 02, o) = 

-h 6 i^ 2*2 + ^l,Oi ^2,1*1 + ^2,2*2 + b2fi) (As. 5) 

enddo enddo 

Suppose the data hyperplanes of v\ and V2 are = %D\,D\\^^\D\ -|- 
0 \D\ = c^Oand respectively. The 

reference functions for each dimension of each array reference is listed as 
follows. 

D\ = oipfi -|- ai_2*2 + oi^O) 

D \ = 02,1*1 + 02,2*2 + 02,0, 

D\ = -h 61,2*2 + 61,0, 

D2 = 62,1*1 + 62,2*2 + 62,0- 

Replacing each with its corresponding reference function, i = 1,2 and 
j = 1,2, the data hyperplanes and can be represented in terms of loop 
indices ii and Z2 as below. 

6 lDl + 0^2 o' 

^ 0 j(ai,ifi -|- ai,2*2 + oi,o) + ^2 (02,1*1 + 02,2*2 + 02,0) = 

^ ( 0101,1 -I- 0202 , l)*l + (0(01,2 + 0(02, 2)*2 = 0 0(01,0 < S > 0(02,0, 

0fZ?f -|- 0|l?| = 

^ 0i(6i,i*i + 61,2*2 + 61,0) + 02(62,1*1 + 62,2*2 + 62,0) = 

^ (0(61,1 + 0262,1)*! + (0(61,2 + 0(62, 2)*2 = 0 0(61,0 <8> 0(62,0- 

As previously stated, these two hyperplanes are the corresponding iteration 
hyperplanes of the two data hyperplanes on iteration space. These two itera- 
tion hyperplanes should be consistent if the requirement of communication- 
free partitioning has to be met. It implies 



0(oi,i + 0(02,1 
0(oi,2 + 0(02,2 
0 0(ai,o 0 0(02,0 



0i6i,i + 0(62,1 

0(61,2 + 0(62,2 
0 0(61,0 <8> 0(62,0 



The above conditions can be represented in matrix form as follows. 



01.1 

01.2 
0Oi,o 



02.1 

02.2 
002,0 




01,1 

61,2 

061,0 



62.1 0 

62.2 0 
062,0 1 



0 ( 

0i 

^2 



( 3 . 2 ) 
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If there exists a nontrivial solution to the linear system obtained from 
Eq. (3.2), the nested loop exists communication-free hyperplane partition- 
ing. 

Example 31. Consider the following loop. 

do ii = 1, N 
do ^2 = 1, 

A{ii -h Z2, ii 0 h) = i?(ii 0 2^2, 2ii 0 12 ) (Ls.e) 

enddo enddo 

Let = n[Dl,D^Y»lDl + 9\D\ = c^Oand = %Dl,Dlfi^lDl + 
^ 2^2 = c^Obe the data hyperplanes of v\ and V 2 , respectively. From Eq. (3.2), 
we have the following system of equations: 

■ 1 1 0 r r r r i 2 o r r 0 ? 

- 1 01 0 ^ - 02 01 0 ^ - el 

0 0 1 I I 0 0 1 I c2 

The solution to this linear system is: 

el = + \e\ 

el = e\ + 

= A 

The linear system exists a nontrivial solution; therefore, loop T 3.6 has 
communication-free hyperplane partitioning. and can be written as: 

= %DlD\Y^\D\ + e\Dl = 

= %DlDlYii^e\ + \e\)Dl + (01 + \e\)Dl = 

Let 0) = 1 and 02 = 1. The hyperplanes and are rewritten as follows. 

=%DlDlYilD\ + Dl = A<> 

^2 = ^[Dl DlYm 2Df + ADl = 3ci^ 

Fig. 3.4 gives an illustration for = 2. 

Ramanujam and Sadayappan’s method can deal with a single nested loop 
well. Their method fails in processing multiple nested loops. This is because 
they did not consider the iteration space partitioning. On the other hand, 
they do well for the fully parallel loop, but they can not handle the loop with 
data dependence relations. These shortcomings will be made up by methods 
proposed in [16,30]. 
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D\ 



Dl 








(a) (b) 

Fig. 3.4. Communication- free data spaces partitioning for loop Ls.e- (a) Data 
hyperplane on data space DS{v\). (b) Data hyperplane on data space 
DS{v2). 



3.3 Hyperplane Partitioning of Iteration and Data Spaces 

Huang and Sadayappan also proposed methods toward the communication- 
free partitioning for nested loops. In this section, we will describe the method 
proposed in [16]. This method is denoted as Huang and Sadayappan’ s method. 
Huang and Sadayappan’s method aims at the findings of iteration hyper- 
planes and data hyperplanes such that, based on the partitioning, the ex- 
ecution of nested loops involves no interprocessor communication. Further- 
more, sufficient and necessary conditions for communication-free hyperplane 
partitioning are also derived. They proposed single-hyperplane and multiple- 
hyperplane partitionings for nested loops. Single-hyperplane partitioning im- 
plies that a partition element contains a single hyperplane per space and 
a partition element is allocated onto a processor. Multiple-hyperplane par- 
titioning means that a partition element contains a group of hyperplanes 
and all elements in a partition group is undertaken by a processor. Multiple- 
hyperplane partitioning can provide more powerful capability than single- 
hyperplane partitioning in communication-free partitioning. For the sake 
of space limitation, we only introduce the single-hyperplane partitioning. 
Multiple- hyperplane partitioning can refer to [16]. 

In Section 3.2, Ramanujam and Sadayappan’s method assumes the generic 
format of data hyperplanes and then determines the coefficients of data hy- 
perplanes. Since Ramanujam and Sadayappan’s method considers only data 
hyperplane partitioning, the loss of sight on iteration hyperplanes causes the 
failure of applying to sequences of nested loops. This phenomenon has been 
improved by Huang and Sadayappan’s method. However, Huang and Sadayap- 
pan’s method requires the nested loops be perfectly nested loop(s). 
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An n-dimensional data hyperplane on DS{vj) is the set of data indices 

+ «<+ which is denoted as 

where 6 \, . . and are hyperplane coefficients and o’ /'Q is the 

constant term of the hyperplane. Similarly, an iteration hyperplane of a d- 
nested loop Li is a set of iterations ^[/|, . . . , = c*0 

and is denoted as <?'* , where , . . . , and Q are hyperplane coefficients 
and c* y'Q is the constant term of the hyperplane. Let Z\* = [5\, . . . , 

be the coefficient vector of iteration hyperplane and = [0^ , 02 j ■ • • i ^n] be 
the coefficient vector of data hyperplane. An iteration hyperplane on IS{Li) 
and a data hyperplane on DS{vj) can be abbreviated as 

¥ = %r *4* = c*0 and <P^ = c^O, 

respectively, where P = [II, I 2 , ■ ■ ■ ,I^iY i® iteration on IS{Li) and = 
[D{, D 2 , ■ ■ ■ , is a data index on DS{vj). If the hyperplane coefficient 
vector is a zero vector, it means the whole iteration space or data space needs 
to be allocated onto a processor. This fact leads to sequential execution and is 
out of the question in this Chapter. Hence, only non-zero iteration hyperplane 
coefficient vectors and data hyperplane coefficient vectors are considered in 
the Chapter. 

For any array reference of array Vj in loop Li, there exists a sufficient 
and necessary condition to verify the relations between iteration hyperplane 
coefficient vector and data hyperplane coefficient vector if communication- 
free requirement is satisfied. The sufficient and necessary condition is stated 
in the following lemma. 

Lemma 31. For a reference function Ref{P) = R + r = D\ which is 
from IS{Li) to DS{vj), = ^P <1^ = FI}is the iteration hyperplane 
on IS{Li) and = c^ffis the data hyperplane on DS{vj). 

and are communication-free hyperplane partitions if and only if A* = 
aO^ P, for some a, a ^0. 

Proof. (^): Suppose that P = = c*0and = c’O 

are communication-free hyperplane partitions. Let I\ and be two distinct 
iterations and belong to the same iteration hyperplane, F''. If D\ and D 2 are 
two data indices such that Ref {ID = D\ and Ref {ID = DD from the above 
assumptions, Hj and D 2 should belong to the same data hyperplane, FL 
Because J| and belong to the same iteration hyperplane, F ^ , A* = & 
and A* = c*, therefore, A* ^I\ (S> ID = 0. On the other hand, since d{ 
and D 2 belong to the same data hyperplane, , it means that 0^ FP\ = c-’ 
and 0^ pi, = cD Replacing Di by reference function Ref {II), for k = 1,2, 
we can obtain {0^ P) 0 ID = 0. 

Since I\ and are any two iterations on F^ , {I\ 0 /^ ) is a vector on the 
iteration hyperplane. Furthermore, both A* ®ID = 0 and {0^ P) 0 
/ 2 ) = 0, hence we can conclude that A* and {0^ Pi) are linearly dependent. 
It implies A* = a0^ pi, for some a, a [15]. 
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(\): Suppose <P = c*Oand <D^ = c-’Oare 

hyperplane partitions for IS (Li) and DS{vj), respectively and Z\* = aO^ 
for some a, o We claim and are communication- free partitioning. 

Let P be any iteration on iteration hyperplane Then Z\* = c*. Since 

Z\* = aO^ replacing Z\* by aO^ ^ can get 0^ ^ef{P) = + 0^ Let 

P = ^P + 0^ then Ref{P) /'<I>K We have shown that ff/* Ref{P) 
I>P It then follows that and are communication-free partitioning. 

Lemma 31 shows good characteristics in finding communication- free par- 
titioning. It can be used for determining the hyperplane coefficient vectors. 
Once the data hyperplane coefficient vectors are fixed, the iteration hyper- 
plane coefficient vectors can also be determined. If the reference matrix R 
is invertible, we can also determine the iteration hyperplane coefficient vec- 
tors first, then the data hyperplane coefficient vectors can be evaluated by 
0^ = ( — )Zi* SR'^^ accordingly, where is the inverse of R. As regards to 
the constant terms of hyperplanes, since the constant terms of hyperplanes 
are correlated to each other, hence, if the constant term of some hyperplane 
is fixed, the others can be represented in terms of that constant term. From 
the proof of Lemma 31, we can know that if P is fixed, P = —P + 0^ 
Generally speaking, in a vector space, a vector does not change its direction 
after being scaled. Since a in Lemma 31 is a scale factor, it can be omit- 
ted without affecting the correctness. Therefore, we always let a = 1 unless 
otherwise noted. 

Example 38. Consider one perfectly nested loop. 

do ii = 1, N 
do ^2 = 1, 

A{ii + Z2, *1 + Z2) = 2 Jj. A(ti + ^2, H + ^ 2 ) <8> 1 (Ls.t) 

enddo enddo 

Suppose the iteration hyperplane on is of the form P = = 

cO and the data hyperplane on DS{A) is P = <D = c 0 From 

Lemma 31, the data hyperplane coefficient vector 0 can be set to arbitrarily 
2 -dimensional vector except zero vector and those vectors that cause the itera- 
tion hyperplane coefficient vectors also to be zero vectors. In this example, let 
0 = [0, 1], then the iteration hyperplane coefficient vector A is equal to [1,1]. 
If the constant term of iteration hyperplane is fixed as c, the data hyperplane 
constant term c = c+ 0 For this example, c = c. Therefore, the iteration 
hyperplane and data hyperplane of loop L 3.7 are P = W,1] <I = c^and 
P = 1 ^ 40 , 1 ] <P = c(}, respectively. That is, P = ^[/i,/ 2 ]*#i + P = cOand 
P = %Di,D2YJP2 = cP Fig. 3.5 illustrates the communication-free hyper- 
plane partitioning of loop ^ 3 . 7 , where c = 5. Fig. 3.5 (b) and (c) are iteration 
hyperplane and data hyperplane, respectively. 

On the other hand, if the data hyperplane coefficient vector 0 is chosen 
as [1, 01], it causes the iteration hyperplane coefficient vector A to be [0, 0], 
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Fig. 3.5. Communication-free hyperplane partitioning of loop ^ 3 , 7 . (a) Iter- 
ation hyperplane partition on IS{L^,j): S' = + I 2 = 5<^ (b) Data 

hyperplane partition on DS{A): ^ D 2 ]* 4 tl 2 = 50 



which is a zero vector. Since all the hyperplane coefficient vectors are non- 
zero vectors, therefore, the above result is invalid. In other words, the data 
hyperplane coefficient vector O can be any 2 -dimensional vector except [ 0 , 0 ] 
and [I, 0 l]. 

By Section 2.3, since the null space of R is NS{R) = span{^[l, the 

spaces caused by the self-dependent relations and cross-dependent relations 
are the same and equal span(^[l, 0l]*O)- The iteration-dependent space of 
loop is. 7 is IDS{L^^j) = span{%l, It means that the iterations along 

the direction [1, 0 l]* should be allocated onto the same processor. This result 
matches that shown in Fig. 3.5. 

Lemma 31 is enough to meet the requirement of communication- free parti- 
tioning for only one array reference. If there are more than one different array 
references in the same loop. Lemma 31 is useful but not enough. More condi- 
tions is attached in order to satisfy the communication-free criteria. Suppose 
there are 7 different array references of array variable Vj in nested loop Li, 
which are Refl’^{P) = R^j^^ R + k = 1 , 2 , ..., 7 . As previously defined, 

the iteration hyperplane is = c* 0 and the data hyperplane is 

= c^(}. By Lemma 31, and are communication- free 
partitioning if and only if Z\* = aO^ where R is some reference matrix 
and a is a non-zero constant. Without loss of generality, let a = 1. Since 
there are 7 different array references, hence, Lemma 31 should be satisfied 
for every array reference. That is, 

Zi* = 0i = 0j 0j (3.3) 

On the other hand, the constant term of the data hyperplane is c = c + 0^ < 
= c + 0^ ^ 2 ^ = <^= c + 0^ if the iteration hyperplane constant 
term is c. It implies that 
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0i ^ 7 ’^ - (3-4) 

Therefore, Eqs. (3.3) and (3.4) are the sufficient and necessary conditions for 
the communication-free hyperplane partitioning of nested loop with several 
array references to an array variable. If there exists a contradiction within 
the findings of the hypcrplanc coefficient vectors, it implies that the nested 
loop exists no communication-free hyperplane partitioning. Otherwise, the 
data hyperplane coefficient vector can be evaluated accordingly. The iteration 
hyperplane coefficient vector can also be determined. As a result, the iteration 
hyperplane and data hyperplane can be resolved. 

Similarly, the same ideas can be extended to sequences of nested loops, 
too. The results obtained by the above analyses can be combined for the 
most general case. Suppose the iteration space for each loop Li is IS (Li). 
The iteration hyperplane on IS{Li) is </* = c*0 The data 

space for array variable Vj is DS{vj) and the hyperplane on DS{vj) is = 
Let Refl’^ be the reference function of the reference 
to array variable Vj in loop Li. Eqs. (3.3) and (3.4) can be rewritten by minor 
modifications to meet the representation. 

03^^ = (3.6) 

Furthermore, since = c* + , for some array variable Vj ^ , — 

_l_ Qji _|_ 031 for two different loops Lq and Lq. We 

can obtain that (S> = 0^ 0 Similarly, for some array 

variable Vj .^ , + 0^^ + 0^ , for two different loops 

Lq and Lq. We get that 0 0 Combining these 

two equations can obtain the following equation. 

0h ^^001 ^ ^ 0h ^ ^i 2 ,i 2 )_ ^3 7 ) 

Thus, Eqs. (3.5), (3.6), and (3.7) are the sufficient and necessary conditions 
for the communication-free hyperplane partitioning of sequences of nested 
loops. 

Example 39. Consider the following sequence of loops. 

do fi = 1, 

do ^2 = 1, fV 

A{ii + Z2 + 2, 02^2 + 2) = B{ii + 1, { 2 ) 

enddo enddo 

do = 1, 

do ^2 = 1, A^ 

B{ii + 2z2, + 4) = A{ii,i2 ® 1) 

enddo enddo 



(As.s) 
(As. 9) 
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To simplify the representation, we number the sequences of loops and 
array variables according to the order of occurrences. Let Li refer to La.s, L2 
refer to ^3.9, ui refer to array variable A, and i>2 refer to array variable B. 
The iteration space of loop Li is IS{Li) and the iteration space of loop L 2 is 
IS{L 2 ). The data spaces of ui and V 2 are DS{vi) and DS{v 2 ), respectively. 
Suppose = c^Ois the iteration hyperplane on IS{L\) and 

1^2 _ ^/2^2 <j2 _ g2,^jg iteration hyperplane on IS{L 2 ). The data 
hyperplanes on DS{v\) and DS{v 2 ) are Oand = 

'O’, respectively. As defined above, Refl’^ is the reference 
function of the reference to array variable Vj in loop Li . 

Since O^, and are 2-dimensional non-zero vectors, let = [d\,9^, 
and where 9\, 9^, 9f, and ^ Q, (9l)^ + 0, and 

(^i)^ + (^i)^ ^0. By Eq. (3.5), we have the following two equations = 

02 ^L2 g)i ^ g,2 Thus, 

[olol]< I I = I 1 • 

There is no condition that satisfies Eq. (3.6) in this example since each array 
variable is referenced just once by each nested loop. To satisfy Eq. (3.7), the 
equation (g) r^’^) = 02 ^rj’2 (g> j-2’2^ jg obtained. That is, 

[^1,^2]^ 2 ® 01 ) = J ® 4 )• 

Solving the above system of equations, we can obtain that 02 = ^1 = 

and 02 = <8)01. Therefore, the data hyperplane coefficient vectors of and 
^2^ 0i and 02, are [0i,0i] and [0(,00j], respectively. Since 0^ and 02 are 
non-zero vectors, 9j /'Q 0 ^0<D 

The coefficient vector of iteration hyperplane is Z\^ = 0^ = 

02 <rL 2 ^ [6|i,00i]. Similarly, ^2 = 0i <rI’^ = 02 ^2,2 ^ [e\^e\]. Let 
the constant term of iteration hyperplane be fixed as c^. Therefore, the 
constant term of iteration hyperplane P^ can, therefore, be computed by 
c2 = -I- 0^ 8 rl’^) = 0 ^ + 0“^ 8 r^’^) = + 50(. The constant 

term of data hyperplane can be evaluated by + 0^ = 

c2 -1- 0^ OTi’^ = + 40J. The constant term of data hyperplane can be 

evaluated by c2 = -|-0^ = c2 -|-02 -|-0i . The communication- 

free iteration hyperplane and data hyperplane partition are as follows. 

P^ = tlll^YJHlll^9ll^ = c^<} 

P^ = ti!,ilY»lii+0lil = c^ + 59lo 
P^ = tDl,DlYJ^lDl+9lDl=c^+49lO 
P^ = tDj,D^YJHlDj®9lD^=c^ + 9l<) 
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Fig. 3.6. Communication-free hyperplane partitionings for iteration spaces 
of loops L3.8 and L3.9 and data spaces of arrays A and B. (a) Iteration 
hyperplane partition on /^(Ls.s): = ^[/i , 0/2 = 00 (b) Iteration 

hyperplane partition on /^(Ls.g): + I 2 = bO (c) Data 

hyperplane partition on DS{A): = %D\,D\Yi!D\ -|- Z?2 = 40 (d) Data 

hyperplane partition on DS{B): = %_Dl, 0 = 10 



Fig. 3.6 shows the communication-free hyperplane partitions for iteration 
spaces IS{Ls,s) and /^'(Ls.g) and data spaces DS{A) and DS{B), assuming 
01 = 1 and = 0. Fig. 3.6 (a) is the iteration space hyperplane partitioning 
of loop L 3.8 and Fig. 3.6 (b) is the iteration space hyperplane partitioning of 
loop Ls.g. Fig. 3.6 (c) and (d) illustrate the hyperplane partitionings on data 
spaces DS{A) and DS{B), respectively. 



4. Statement-Level Partitioning 

Traditionally, the concept of the iteration space is from the loop- level point of 
view. An iteration space is formed by iterations. Each iteration is an integer 
point in iteration space and is indexed by the values of loop indices. Every 
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iteration consists of all statements of that index within the loop body. The ex- 
ecution of an iteration includes all the execution of statements of that index. 
Actually, each statement is an individual unit and can be scheduled sepa- 
rately. Therefore, instead of viewing each iteration indivisible, an iteration 
can be separated into the statements enclosed in that iteration. The sepa- 
rated statements have the same index with that iteration and each is termed 
as a statement-iteration. We use 7® to denote a statement-iteration of state- 
ment s. Since iteration space is composed of iterations, statement-iterations 
of a statement also form a space. Each statement has its corresponding space. 
We use statement-iteration space, denoted as SIS{s), to refer the space com- 
posed by statement-iterations of s. Statement- iteration space has the same 
loop boundaries with the corresponding iteration space. Generally speaking, 
statement-iteration space and iteration space have similar definitions except 
the viewpoint of objects; the former is from the statement-level point of view 
and the latter is from the loop- level viewpoint. In this section we describe 
two statement-level communication-free partitionings: one is using affine pro- 
cessor mappings [22] and another is using hyperplane partitioning [30]. 

4.1 Affine Processor Mapping 

The method proposed in [22] considers iteration spaces partitioning, espe- 
cially statement-level partitioning, to totally eliminate interprocessor com- 
munication and simultaneously maximizes the degree of parallelism. We use 
Lim and Lam ’s method to refer the technique proposed in [22] . They use affine 
processor mappings to allocate statement-iterations to processors. The major 
consideration of Lim and Lam ’s method is to find maximum communication- 
free parallelism. That is, the goal is to find the set of affine processor mappings 
for statements in the program and to exploit as large amount of parallelism 
as possible on the premise that no interprocessor communication incurs while 
execution. Lim and Lam’s method deals with the array references with affine 
functions of outer loop indices or loop invariant variables. Their method can 
be applied to arbitrarily nested loops and sequences of loops. 

The statement-iteration distribution scheme adopted by Lim and Lam’s 
method is affine processor mapping, which is of the form Proc’‘{P) = 
for statement s^. It maps each statement- iteration P in SIS{si) to a (virtual) 
processor Proc’{P). P* is the mapping matrix of Si and p* is the mapping 
offset vector of Si. Maximizing the degree of parallelism is to maximize the 
rank of P*. To maximize the rank of P* and to minimize the dimensionality 
of the null space of P* are conceptually the same. Therefore, minimizing the 
dimensionality of the null space of P* is one major goal in Lim and Lam’s 
method. 

Similar to the meanings of iteration-dependent space defined in Sec- 
tion 2.3, they define another term to refer to those statement-iterations which 
have to be mapped to the same processor. The statement-iterations which 
have to be mapped to the same processor are collected in the minimal localized 
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statement-iteration space, which is denoted as Li for statement Si. Therefore, 
the major goal has changed from the minimization of the dimensionality 
of the null space of P* to the finding of the minimal localized statement- 
iteration space. Once the minimal localized statement-iteration space of each 
statement is determined, the maximum degree of communication-free par- 
allelism of each statement can be decided by dim(5'/5'(si))(8)dim(Pi). Since 
each statement’s maximum degree of communication- free parallelism is dif- 
ferent, in order to preserve the communication- free parallelism available to 
each statement, Lim and Lam’s method chooses the maximum value among 
all the degrees of communication-free parallelism of each statement as the 
dimensionality of the virtual processor array. By means of the minimal local- 
ized statement-iteration space, the affine processor mapping can be evaluated 
accordingly. The following examples demonstrate the concepts of Lim and 
Lam’s method proposed in [22]. 

Example 4L Consider the following loop. 

do zi = 1, 
do ^2 = 1, 

51 : A{ii,i2) = A{ii igi l,i2) + B{ii,i2 ® 1) {L 4 . 1 } 

5 2 '■ B{ii,i2) = d.(fi, ^2 + 1) + P(*i + 1, ^ 2 ) 

enddo enddo 



Loop p 4 ,i contains two statements and there are two array variables refer- 
enced in the nested loop. Let v\ = A and V 2 = B. We have defined statement- 
iteration space and described the differences of it from iteration space above. 
Fig. 4.1 gives a concrete example to illustrate the difference between itera- 
tion space and statement-iteration space. In Fig. 4.1(a), a circle means an 
iteration and includes two rectangles with black and gray colors. The black 
rectangle indicates statement si and the gray one indicates statement S 2 . 
In Fig. 4.1(b) and Fig. 4.1(c), each statement is an individual unit and the 
collection of statements forms two statement-iteration spaces. 

Let S be the set of statements and V be the set of array variables ref- 
erenced by S. Suppose S = ^si, S 2 , . • • , SaOand V = %v\,V 2 , ■ ■ ■ , vp(}, where 
a, (3 /'Z+. For this example, a = 2 and (3 = 2. Let the number of occurrences 
of variable Vj in statement Si be denoted as 71 ,^. For this example, 717 = 2, 
71,2 = 1; 72.1 = 1; and 727 = 2. Let Refl’^ denote the reference function 
of the occurrence of array variable Vj in statement Si, where 1 -^i -^a, 
1 j (3, and \ A statement-iteration on a d-dimensional 

statement-iteration space SIS{s) can be written as 1“ = Let 

denote the component of statement-iteration /*. The reference functions 
for each array reference are described as follows. 



Refl’\n = 

iie/P(/2) = 
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Fig. 4.1. The illustrations of the differences between iteration space and 
statement-iteration space. 

Communication-free partitioning requires the referenced data be located 
on the processor performing that execution, no matter whether the data is 
read or written. Therefore, it makes no difference for communication-free par- 
titioning whether the data dependence is true dependence, anti-dependence, 
output dependence or input dependence. Hence, in Lim and Lam’s method, 
they defined a function, co-reference function, to keep the data dependence 
relationship. The co-reference function just keeps the data dependence re- 
lationship but does not retain the order of read or write. Let ^/s,s' be the 
co-reference function and can be defined as the set of statement-iterations 

such that the data elements referenced by Ig are also referenced by Ig', 
where s,s /'S. Fig. 4.2 gives an abstraction of co-reference function. 




SlSit') S1S{P) 



Fig. 4.2. The abstraction of co-reference function. 
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Accordingly, 

AM^) = = *i) = i\)A = «} 0 1) oc(f^' = ilA 

=i\ + 1) oc(f^' = i\A 

v/i, 2 (/^) = = *}) oc(z| = i\® 1)0/ = f} 0 1) oc(i| = 0 1)0 

A2,i{A) = = /?) oc(f^ = il + 1)0' =i\ + l) oc(f^ = il + 1)0 

/2,2(/2) = = '?) ^ ('i' = il)A = /? + 1) oc (zi' = zi)0 

0 1) oc(z|' = z|)0 

As previously described, finding the minimal null space of P* is the same 
as to find the minimal localized statement-iteration space. Hence, how to 
determine the minimal localized statement-iteration space of each statement 
is the major task of Lim and Lam’s method. The minimal localized statement- 
iteration space is composed of the minimum set of column vectors satisfying 
the following conditions: 

— Single statement: The data dependence relationship within a statement- 
iteration space may be incurred via the array references in the same state- 
ment or between statements. This requirement is to map all the statement- 
iterations in a statement-iteration space that directly or indirectly ac- 
cess the same data element to the same processor. In other words, these 
statement-iterations should belong to the minimal localized statement- 
iteration space. 

— Multiple Statements: For two different statements Sij and Si .^ , suppose 
pi and Pi are two statement-iterations in SIS{sA and Z' 

and P 2 /' If statement-iterations P^ and P^ are mapped to the 
same processor, this requirement requires all the statement-iterations P^ 
and P 2 being mapped to the same processor. 

Figs. 4.3 and 4.4 conceptually illustrate the conditions Single Statement 
and Multiple Statements, respectively. The boldfaced lines in the two 
figures are the main requirements that these two condition want to meet. 




Fig. 4.3. The abstraction of Single Statement condition. 
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SlSiP) SIS{P) 



Fig. 4.4. The abstraction of Multiple Statements condition. 



An iterative algorithm can be used to evaluate the minimal localized 
statement-iteration space of each statement. First, initialize each Li us- 
ing condition Single Statement. Second, iterate using condition Multiple 
Statements until all the Li is converged. In what follows, we use this itera- 
tive algorithm to evaluate the minimal localized statement-iteration space of 
each statement for this example. Based on the condition Single Statement, 
Li is initialized to ^[l,0]*0and L 2 is initialized to ^[1,0]*0 The algorithm 
iterates according to the condition Multiple Statements to check if there is 
any column vector that should be added to the minimal localized statement- 
iteration spaces. The algorithm considers one minimal localized statement- 
iteration space at a time. For all other minimal localized statement-iteration 
spaces, the iterative algorithm uses condition Multiple Statements to add 
column vectors to the minimal localized statement-iteration space, if any. 
Once all the localized statement-iteration spaces are converged, the algorithm 
halts. As for this example, the iterative algorithm is halted when L\ and L 2 
both converge to Thus, the minimal localized statement-iteration 

spaces Li and L 2 have been evaluated and all equal ^[1,0]*'0 

For any two statement-iterations, if the difference between these two 
statement-iterations belongs to the space spanned by the minimal localized 
statement-iteration space, these two statement-iterations have to be mapped 
to the same processor. Therefore, the orthogonal complement of the minimal 
localized statement-iteration space is a subspace that there exists no data de- 
pendent relationship within the space. That is, all statement-iterations that 
located on the the orthogonal complement of the minimal localized statement- 
iteration space are completely independent. Accordingly, the maximum de- 
gree of communication-free parallelism of a statement is the dimensionality 
of the statement-iteration space SIS{si) minus the dimensionality of the 
minimal localized statement-iteration space Li. Let the maximum degree of 
communication- free parallelism available for statement Si be denoted as t^. 
Then, = dim(5'/S'(si)) 0 dim(Li). Thus, t\ = 1 and T 2 = 1. Lim and 
Lam ’s method wants to exploit as large amount of parallelism as possible. To 
retain the communication- free parallelism of each statement, the dimension- 
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ality of (virtual) processor array has to set to the maximum value of maxi- 
mum degree of communication- free parallelism among all statements. Let Tp 
be the dimensionality of the (virtual) processor array. It can be defined as 
Tp = maXi For this example, Tp = max(ri,T 2 ) = 1. 

We have decided the dimensionality of (virtual) processor array. Finally, 
we want to determine the affine processor mapping for each statement by 
means of the co-reference function and the minimal localized statement- 
iteration space. To map a d-dimensional statement-iteration space to a Tp- 
dimensional processor array, the mapping matrix P in the affine processor 
mapping is a Tp ood matrix and the mapping offset vector p is a Tp ool vector. 
For each affine processor mapping ProP{P) = P^P +p*, the following two 
constraints should be satisfied, where i 2, . . . , a(). 

Cl span(Li) e null space of Prod. 

C2 /SIS{s,),^P' //.,p(/0 : ProP{P') = 

Prod (P). 

Condition Cl can be reformulated as follows. Since span{Li) e null space 
of Prod, it means that if P,P' /'SIS{si) and {P' 0/*) /'span{Li), then 
Prod{P ) = Prod{P). Thus, P^P' +p* = P^P +p*. It implies that P*(/*' 0 
P) = ~ where ~is a Tp ool zero vector. Because {P 0 P) /'span{Li), we 
can conclude that 

Cl frx = o: 

A straightforward algorithm to find the affine processor mappings accord- 
ing to the constraints Cl and C2 is derived in the following. First, choose 
one statement that its maximum degree of communication-free parallelism 
equals the dimensionality of the (virtual) processor array, say Si. Find the 
affine processor mapping Prod such that the constraint Cl is satisfied. Since 
span{Li) has to be included in the null space of Prod , it means that the range 
of Prod is the orthogonal complement of the space spanned by Pi. There- 
fore, one intuitively way to find the affine processor mapping Prod is to set 
Prod{P) = {LiY P, where means the orthogonal complement of the 
space W. The mapping offset vector p* is set to a zero vector. Next, based on 
the affine processor mapping Prod, use constraint C2 to find the other state- 
ments’ affine processor mappings. This process will repeat until all the affine 
processor mappings are found. Using the straightforward algorithm described 
above, we can find the two affine processor mappings Proc^ = [0, 1]/^ and 
Proc^ = [0, V\P + 1. Fig. 4.5 shows the communication-free affine processor 
mappings of statements si and S 2 for loop p 4 ,i- 

Data distribution is an important issue for parallelizing compilers on 
distributed-memory multicomputers. However, Lim and Lam’s method ig- 
nores that. The following section describes the communication- free hyper- 
plane partitioning for iteration and data spaces. 
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1-dim 

processor array 




Fig. 4.5. Communication-free affine processor mappings Proc^(I^) = [0, 1]/^ 
and Proc^{P) = [0, 1]P -|- 1 of statement si and S 2 for foop ^ 4 . 1 , assuming 
N = 5. 

4.2 Hyperplane Partitioning 

In this section, the method proposed by Shih, Sheu, and Huang [30] studies 
toward the statement-level communication-free partitioning. They partition 
statement-iteration spaces and data spaces along hyperplanes. We use Shih, 
Sheu, and Huang’s method to denote the method proposed in [30]. Shih, 
Sheu, and Huang’s method can deal with not only an imperfectly nested 
loop but also sequences of imperfectly nested loops. They propose the suffi- 
cient and necessary conditions for the feasibility of communication-free single- 
hyperplane partitioning for an imperfectly nested loop and sequences of im- 
perfectly nested loops. The main ideas of Shih, Sheu, and Huang’s method is 
similar to those proposed in Sections 3.2 and 3.3. In the following, we omit the 
tedious mathematical inference and just describe the concepts of the method. 
The details is referred to [30]. 

As defined in Section 4.1, <5 = 1 si , S 2 , • • • , SaO is the set of statements 
and V = ^v\,V 2 , ■ ■ ■ ,n^<[)is the set of array variables, where a,j3 The 

number of occurrences of array variable Vj in statement Si is 7 ^,^. If Vj is not 
referenced in statement Si, 'jij = 0. The reference function of the array 
reference of array variable Vj in statement Si is denoted as Refl’^ , where 
i /I]!, 2,...,a<), j /I]!, 2 , . . . , f3<), and k /'%!, 2 , . . . , 

Suppose 'I'* = = c^Ois the statement-iteration hyperplane on 

SIS{si) and <D^ = o^Ois the data hyperplane on DS{vj). 

Statement-level communication-free hyperplane partitioning requires those 
statement-iterations that reference the same array element be allocated 
on the same statement- iteration hyperplane. According to Lemma 21, two 
statement-iterations reference the same array element if and only if the dif- 
ference of these two statement-iterations belongs to the null space of R]l^ , 
for some i,j and k. Hence, NS{R]l^) should be a subspace of the statement- 
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iteration hyperplane. Since there may exist many different array references, 
partitioning a statement-iteration space must consider all array references 
appeared in the statement. Thus, the space spanned by NS{R]!^) for all 
array references appearing in the same statement should be a subspace of 
the statement-iteration hyperplane. Therefore, the above observations is con- 
cluded in the following lemma. 

Lemma 41 (Statement-Iteration Hyperplane Coefficient Check). 

For any communication-free statement-iteration hyperplane = 

df), the following two eonditions must hold: 

(1) span{> NS{lt^)) G 

(2) {AJ Aspanif / Ax , 

where denotes the orthogonal complement space of S. 

On the other hand, the dimension of a statement-iteration hyperplane is 
one less than the dimension of the statement-iteration space. If there exists 
a statement Si, for some i, such that the dimension of the spanning space 
of NS{R^f^), for all j and k, is equal to the dimension of SIS{si), then the 
spanning space cannot be a subspace of the statement-iteration hyperplane. 
Therefore, there exists no nontrivial communication-free hyperplane parti- 
tioning. Thus, we obtain the following lemma. 

Lemma 42 (Statement-Iteration Space Dimension Check). If ^Si / 

S such that 

dim{span{f / Ax H^il^k^))) ~ dim{SIS{si)), 

then there exists no nontrivial communication-free hyperplane partitioning. 

In addition to the above observations, Shih, Sheu, and Huang’s method 
also finds more useful properties for the findings of communication-free hy- 
perplane partitioning. Lemma 31 demonstrates that the iteration hyperplane 
and data hyperplane are communication-free partitioning if and only if the 
iteration hyperplane coefficient vector is parallel to the vector obtained by 
the multiplication of the data hyperplane coefficient vector and the refer- 
ence matrix. Although Lemma 31 is for iteration space, it also holds true 
for statement-iteration space. Since the statement-iteration hyperplane coef- 
ficient vector is a non-zero vector, thus the multiplication of the data hyper- 
plane coefficient vector and the reference matrix can not be a zero vector. 
From this condition, we can derive the feasible range of a data hyperplane 
coefficient vector. Therefore, we obtain the following lemma. 

Lemma 43 (Data Hyperplane Coefficient Check). For any communi- 
cation-free data hyperplane = cA, the following condition 

must hold: 

{o^r A'?=x'AxNS{{ir^nA, 

where S denotes the complement set of S. 
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Lemmas 41 and 43 provide the statement-iteration hyperplane coefhcient 
vector check and the data hyperplane coefhcient vector check, respectively. 
Suppose the data hyperplane on data space DS{vj) is = 

(}. Since each data element is accessed by some statement-iteration via some 
reference function, that is, can be represented as Refl’^ = , 

thus, 

0 i = c\ 

^ + rl^) = c\ 

^ ( 0 ^ = c^ 

Let 

(4.1) 

c* = o’ (E)i0^ (4.2) 

As a result, those statement-iterations that reference the data elements lay 
on the data hyperplane <D^ = c-^<)>will be located on the 

statement-iteration hyperplane { 0 ^ ^k^)^- 

Since there are three parameters in the above formulas and i 2, . . ., 
a(), j ^^1,2,..., (3(), and k /"^l, 2, . . . , jij'Oi for the consistency of hyper- 
plane coefficient vectors and constant terms in each space, we can derive 
some conditions. Combining the above constraints can obtain the following 
theorem. 

Theorem 41. Let S = ^si, S 2 , • • • , SuO V = ^vi,V 2 , ■ ■ ■ ,vp(} be the 
sets of statements and array variables, respectively. Ref^J^ is the reference 
function of the occurrence of array variables vj in statement Si, where 
i /HI, 2, . . . , oO, i /HI, 2, . . . , /30 and A; /HI, 2, . . . , 7 ,,,/. = HT*i* = 

df) is the statement-iteration hyperplane on SIS{si), for i = 1,2,..., a. 

= c-^Ois the data hyperplane on DS{vj), for j = 1, 2, . . . , /3. 
If'* and are communication-free hyperplane partitions if and only if the fol- 
lowing conditions hold. 

1 . = 0 ^ , for J = 1 , 2 , . . . ,(3-, k = 2,3,...,j,,,. 

2 . /, 0 ^ = ©1 for J = 2 , 3,..., p. 

3. /, 0^ = 0^ , for j = 1,2,.. .,[3; A: = 2, 3, ... , 7 ^ 7 . 

4 . 0^ /r/ ® ri’/ = ©1 ® ri’^), for i = 2,3, ... ,a] j = 2,3, ..., (3. 

5. /, { 0 ^r A' A' k^i • 

6 . /, A* = ©■? A-k^, for some j, k, j /HI, 2, ...,/?<> A: /HI, 2, . . . , 

7. /, AY Aspan{f / Ai NS{irA)Y . 

8 - fli, i = 2,3, . . . , (3, d = c^0©^ +0^ ’ A some i, i /HI, 2, . . . , a(f. 

9. /, c* = d 0 (©^' for some j, k, j /HI, 2,...,f3<fk /HI, 2, . . .,'Yi,jO- 

Theorem 41 can be used to determine whether the nested loop(s) is/are 
communication-free. It can also be used as a procedure of finding a com- 
munication-free hyperplane partitioning systematically. Conditions 1 to 4 in 
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Theorem 41 are used for finding the data hyperplane coefficient vectors. Con- 
dition 5 can check whether the data hyperplane coefficient vectors found in 
preceding steps are within the legal range. Following the determination of the 
data hyperplane coefficient vectors, the statement-iteration hyperplane coef- 
ficient vectors can be obtained by using Condition 6. Similarly, Condition 7 
can check whether the statement-iteration hyperplane coefficient vectors are 
within the legal range. The data hyperplane constant terms and statement- 
iteration hyperplane constant terms can be obtained by using Conditions 8 
and 9, respectively. If one of the conditions is violated, the whole procedure 
will stop and verify that the nested loop has no communication-free hyper- 
plane partitioning. 

From Conditions 1 and 3, to satisfy the constraint that 0^ is a non-zero 
row vector, we have the following condition. 

Rank{R^^ O R^^,^R^^ 0 Ri;l., 

0 , «< 0 < dim{DS{vj)), (4.3) 

for f = 1, 2, . . . , a and j = 1,2, ... ,j3. Note that this condition can also be 
found in [16] for loop-level hyperplane partitioning. We conclnde the above 
by the following lemma. 

Lemma 44 (Data Space Dimension Check). Suppose S = 

Sa(} and V = ^V\,V2, ■ ■ ■ , are the sets of statements and array variables, 
respectively. Rfjf and r]t^ are the reference matrix and the reference vector, 
respectively, where i /'^l, 2, . . . , aCl j /"fli 2, . . . , and k /"%!, 2, . . . , 

If communication-free hyperplane partitioning exists then Eq. (4.3) must hold. 

Lemmas 42 and 44 are sufficient but not necessary. Lemma 42 is the 
statement-iteration space dimension test and Lemma 44 is the data space di- 
mension test. To determine the existence of a communication- free hyperplane 
partitioning, we need to check the conditions in Theorem 41. We show the 
following example to explain the finding of communication-free hyperplanes 
of statement-iteration spaces and data spaces. 



Example Consider the following sequence of imperfectly nested loops. 



do i\ = 1, N 
do Z 2 = 1, 

si: A\ii + i2, 1] = B[ii + Z2 + 1, + *2 + 2] + 

C\ii + 1, 02fi + 2i2, 2i\ 0 *2 + 1] 

do i^ = 1, N 

S2'- B\ii + is + 1, *2 + *3 +1] = d.[2ii + 2is, 12 + is]+ 

C[ii + 12 + I, 0i2 + is + l,ii C) i 2 + 1] 

enddo enddo enddo (L 4 . 2 ) 



do ii = 1, 

do i 2 — 1, N 
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do is = 1, N 

S3: C\ii + I] * 2 ) ^2 + *3] = A[2zi + 3*2+ *1 + *2 + 2] + 

B\ii + *2) *1 *3 + 1] 

enddo 

S4: +[* 1 , *2 + 3] = B[ii 0 * 2 , *1 0 *2 + 2] + C[ii + * 2 , < 8 * 2 , < 81 * 2 ] 

enddo enddo 

The set of statements S is 1+ , S2,S3,S4'0 The set of array variables is 
V = ,V 2 ,vs(}, where vi,V 2 , and vs represent A, B, and C, respectively. The 

values of 711, 712, 713, 721, 722, 723, 73i, 732 , 733 , 74i, 742, and 743 all are 1. We 
use Lemmas 42 and 44 to verify whether +4. 2 has no communication-free hy- 
perplane partitioning. Since dim{ j NS{R\^)) = 1, which is smaller than 
dim{S I S {si)) , for * = 1, . . . , 4. Lemma 42 is helpless for ensuring that L4.2 
exists no communication-free hyperplane partitioning. Lemma 44 is useless 
here because all the values of are 1, for * = 1, . . . , 4; j = 1, . . . , 3. Further 
examinations are necessary, because Lemmas 42 and 44 can not prove that 
+4,2 has no communication- free hyperplane partitioning. Prom Theorem 41, if 
a communication- free hyperplane partitioning exists, the conditions listed in 
Theorem 41 should be satisfied; otherwise, +4.2 exists no communication-free 
hyperplane partitioning. 

Let <P = c^Obe the statement- iteration hyperplane on 

SIS{si) and = cjObe the data hyperplane on DS{vj). Due 

to the dimensions of the data spaces DS{vi), DS{v 2 ), and DS{vs) are 2, 2, 
and 3, respectively, without loss of generality, the data hyperplane coefficient 
vectors can be respectively assumed to be 0^ = [0\.,9\], 0^ = [ 0 \. 02 ], and 
0^ = [ 01 , 02 , ^ 3 ]- what follows, the requirements to satisfy the feasibility 
of communication-free hyperplane partitioning are examined one-by-one. 

There is no need to examine the Conditions 1 and 3 because all the val- 
ues of 'jij are 1. Solving the linear system obtained from Conditions 2 and 
4 can get the general solutions: (0i, 9^, 9\, 0|, 0i,0|, 0f) = (t, 0f, 2t, 0 t, t, 
t, t), t /'Q 0 ^00 Therefore, 0^ = [f, 0t], 0^ = [2t, 0 t] and 0^ = 
Verifying Condition 5 can find out that all the data hyperplane coefficient 
vectors are within the legal range. Therefore, the statement-iteration hyper- 
plane coefficient vectors can be evaluated by Condition 6 . Thus, 

= [2t, 0 t, t], A^ = and A^ = [t, 0 t]. The legality of these 

statement-iteration hyperplane coefficient vectors is then checked by using 
Condition 7. Checking Condition 7 can know that all the statement-iteration 
and data hyperplane coefficient vectors are legal. These results reveals that 
the nested loops have communication- free hyperplane partitionings. Finally, 
the data and statement-iteration hyperplanes constant terms are decided by 
using Conditions 8 and 9, respectively. Let one data hyperplane constant 
term be fixed, say c^. The other hyperplane constant terms can be deter- 
mined accordingly. Therefore, = cj + t, + 3t, cj = + 1, cj. = cl, 

Cg = cj + 2t, and cl. = cl + 3t. Therefore, the communication-free hyperplane 
partitionings for loop + 4,2 is G = f where 
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=clO, 

= 1/3 ^3 = ci + 2t^ 

= 1/4 *[t, ^t] ^4 = ci + 

/>1 = 1 i /4 = 

= 1i/2 *[2t, 0t] ^2 ^ ci + 

/>3 = 1i/3 t, t] ^3 = ci + 3t0 

Fig. 4.6 illustrates the communication-free hyperplane partitionings for loop 
L 4 . 2 , where t = 1 and = 0. The corresponding parallelized program is as 
follows. 

doall c = 07, 18 

do = max(min(c 0 4, \^^/), 1), min(max(c, 1 ^^V),5) 
if( max(c 0 4, 1) min(c, 5) ) 

^2 = C0 + 1 

A\i\ + ^2) 1] = B\i\ + *2 0 1) 0 + *2 + 2] + 

C\i\ + 1, 02fi + 2*2, 2*1 0 *2 + 1] 

endif 

do *2 = max(2*i 0 c + 1, 1), min(2*i 0 c + 5, 5) 

*3 = C 0 2*1 + *2 

/3[*i + *3 + 1, *2 + *3 + 1]= 7l[2*i + 2*3, *2 + * 3 ] + 

C[h + «2 + 1, 0«2 + i3 + 1,*1 <8> *2 + 1] 

enddo 

enddo 

do *1 = max(c 0 13, 1), min(c 0 1, 5) 

do *2 = max(y^^^/, 1), min(i i«|±iv,5) 

*3 = c 0 *1 0 2*2 + 2 

C[*1 + 1, *202 + *3] = 2l[2*i + 3*2 + *3) *1 + *2 + 2] + 

B[h + *2, ii 0 *3 + 1] 

enddo enddo 

do *1 = max(c + 4, 1), min(c + 8, 5) 

*2 = *1 0 c 0 3 

A[*i, *2 + 3] = B[ii 0 *2, *1 0 *2 + 2] + C[i\ + *2, 0*2, 0*2] 

enddo 

enddoall 



5. Comparisons and Discussions 

Recently, communication-free partitioning has received much emphasis for 
parallelizing compilers. Several partitioning techniques are proposed in the 
literature. In the previous sections we have glanced over these techniques. 
Chen and Sheu’s and Ramanujam and Sadayappan’s methods can deal with 
single loop. Since Ramanujam and Sadayappan’s method does not address 
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(e) (f) (g) 

Fig. 4.6. Communication-free statement-iteration hyperplanes and data hy- 
perplanes for loop i4.2, where t = 1 and = 0 . (a) Statement-iteration 
hyperplane on SIS{s\). (b) Statement-iteration hyperplane on SIS{s2)- (c) 
Statement-iteration hyperplane on SIS{s3). (d) Statement-iteration hyper- 
plane on SIS{s4). (e) Data hyperplane on DS{A). (f) Data hyperplane on 
DS{B). (g) Data hyperplane on DS{C). 
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iteration space partitioning, the absence of iteration space partitioning makes 
the method fail to handle multiple nested loops. Huang and Sadayappan’s, 
Lim and Lam’s and Shih, Sheu, and Huang’s methods all can deal with a 
sequence of nested loops. Besides, Chen and Sheu’s and Huang and Sadayap- 
pan’s methods address only perfectly nested loop(s) but all the others can 
manage imperfectly nested loop(s). Except that Ramanujam and Sadayap- 
pan’s method requires the nested loops to be fully parallel, others can process 
the nested loop(s) with or without data dependence relations. As for the array 
reference function, each method can process affine array reference functions 
except that Chen and Sheu’s method requires the loop be uniformly gener- 
ated reference, in addition to be an affine function. 

We classify these methods as loop-level partitioning or statement-level 
partitioning. Loop-level partitioning views each iteration as a basic unit and 
partitions iterations and/or data onto processors. Chen and Sheu’s, Ramanu- 
jam and Sadayappan’s and Huang and Sadayappan’s methods are loop-level 
partitioning. Lim and Lam’s and Shih, Sheu, and Huang’s methods parti- 
tion statement-iterations and/or data onto processors and are statement-level 
partitioning. The partitioning strategy used by Chen and Sheu’s method is 
similar to the findings of the iteration-dependent space. Once the iteration- 
dependent space is determined, the data accessed by the iterations on the 
iteration-dependent space are grouped together and then distributed onto 
processors along with the corresponding iteration-dependent space. Lim and 
Lam’s method partitions statement-iteration spaces by using affine processor 
mappings. All the rest of methods partition iteration and/or data spaces along 
hyperplanes. Except Ramanujam and Sadayappan’s method addresses data 
space partitioning and Lim and Lam’s method addresses statement-iteration 
space partitioning, the others propose both iteration and data spaces parti- 
tionings. 

It is well-known that the dimensionality of a hyperplane is one less than 
the dimensionality of the original vector space. Therefore, the exploited de- 
gree of parallelism for hyperplane partitioning techniques is one. On the 
other hand, Chen and Sheu’s and Lim and Lam’s methods can exploit max- 
imum degree of communication-free parallelism. All methods discuss the 
communication-free partitioning based on each data element to be distributed 
onto exactly one processor, every other processor that needs the data element 
has to access the data element via interprocessor communication. However, 
Chen and Sheu’s method presents not only the non-duplicate data strategy 
but also duplicate data strategy, which allows data to be appropriately dupli- 
cated onto processors in order to make the nested loop to be communication- 
free or to exploit higher degree of parallelism. 

Eor simplicity, we number each method as follows. 

1. Chen and Sheu’s method. 

2. Ramanujam and Sadayappan’s method. 

3. Huang and Sadayappan’s method. 




380 Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu 



4. Lim and Lam’s method. 

5. Shih, Sheu, and Huang’s method. 

Synthesizing the above discussions can obtain the following tables. Table 5.1 
compares each method according to the loop model that each method can 
deal with. It compares Loop(s), Nest, Type, and Referenee Function. Loop(s) 
means the number of loops that the method can handle. Nest indicates the 
nested loop to be perfectly or imperfectly. Type indicates the type of the 
nested loop, fully parallel or others. Reference Function denotes the type of 
array reference functions. Table 5.2 compares the capabilities of each meth- 
ods. Level indicates the method that is performed on loop- or statement-level. 
Partitioning Strategy is the strategy adopted by each method. Partitioning 
Space shows the spaces that the method can partition, which includes com- 
putation space partitioning and data space partitioning. Table 5.3 compares 
the functionalities of each method. Degree of Parallelism represents that 
the method can exploit how many degree of parallelism on the premise of 
communication-free partitioning. Duplicate Data means whether the method 
can allow data to be duplicated onto processors or not. 



Table 5.1. Comparisons of communication- free partitioning techniques - 
Loop Model. 



Method 


Loop(s) 


Nest 


Type 


Reference Function 


1 


single 


perfectly 


arbitrary 


affine function with 
uniformly generated 

reference 


2 


single 


imperfectly 


fully par- 
allel 


affine function 


3 


multiple 


perfectly 


arbitrary 


affine function 


4 


multiple 


imperfectly 


arbitrary 


affine function 


5 


multiple 


imperfectly 


arbitrary 


affine function 



Table 5.2. Comparisons of communication- free partitioning techniques - 
Capability. 



Method 


Level 


Partitioning Strategy 


Partitioning Space 


Computation 


Data 


1 


loop 


iteration-dependent space 


yes 


yes 


2 


loop 


hyperplane 


no 


yes 


3 


loop 


hyperplane 


yes 


yes 


4 


statement 


affine processor mapping 


yes 


no 


5 


statement 


hyperplane 


yes 


yes 
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Table 5.3. Comparisons of communication- free partitioning techniques - 
Functionality. 



Method 


Degree of Parallelism 


Duplicate Data 


1 


maximum communication-free parallelism 


yes 


2 


1 


no 


3 


1 


no 


4 


maximum communication-free parallelism 


no 


5 


1 


no 



6. Conclusions 

As the cost of data communication is much higher than that of a primitive 
computation in distributed-memory multicomputers, reducing the communi- 
cation overhead as much as possible is the most promising way to achieve high 
performance computing. Communication-free partitioning is an ideal situa- 
tion that can eliminate total communication overhead, if possible. Therefore, 
it is of critical important for distributed-memory multicomputers. We have 
surveyed the current compilation techniques about communication-free par- 
titioning of nested loops in the Chapter. The characteristics of every methods 
and the differences among them are also addressed. 

Communication- free partitioning is an ideal situation that communication 
overhead can be totally eliminated. However, there are many programs that 
can not be communication- free partitioned. Supporting efficient partitioning 
techniques to reduce communication overhead as much as possible is the 
future research in this area. 
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Summary. Data and computation alignment is an important part of compiling 
sequential programs to architectures with non-uniform memory access times. In 
this paper, we show that elementary matrix methods can be used to determine 
communication-free alignment of code and data. We also solve the problem of 
replicating data to eliminate communication. Our matrix-based approach leads to 
algorithms which work well for a variety of applications, and which are simpler and 
faster than other matrix-based algorithms in the literature. 



1. Introduction 

A key problem in generating code for non-uniform memory access (NUMA) 
parallel machines is data and computation placement — that is, determining 
what work each processor must do, and what data must reside in each local 
memory. The goal of placement is to exploit parallelism by spreading the 
work across the processors, and to exploit locality by spreading data so that 
memory accesses are local whenever possible. The problem of determining a 
good placement for a program is usually solved in two phases called align- 
ment and distribution. The alignment phase maps data and computations to 
a set of virtual processors organized as a Cartesian grid of some dimension 
(a template in HPF Fortran terminology). The distribution phase folds the 
virtual processors into the physical processors. The advantage of separating 
alignment from distribution is that we can address the collocation problem 
(determining which iterations and data should be mapped to the same pro- 
cessor) without worrying about the load balancing problem. 

Our focus in this paper is alignment. A complete solution to this problem 
can be obtained in three steps. 

1. Determine the constraints on data and computation placement. 

2. Determine which constraints should be left unsatisfied. 

3. Solve the remaining system of constraints to determine data and compu- 
tation placement. 

° An earlier version of this paper was presented in the 7th Annual Workshop on 
Languages and Compilers for Parallel Computers (LCPC), Ithaca, 1994. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 385-411, 2001. 
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In the first step, data references in the program are examined to deter- 
mine a system of equations in which the unknowns are functions representing 
data and computation placements. Any solution to this system of equations 
determines a so-called communication-free alignment [6] — that is, a map 
of data elements and computations to virtual processors such that all data 
required by a processor to execute the iterations mapped to it are in its local 
memory. Very often, the only communication-free alignment for a program 
is the trivial one in which every iteration and datum is mapped to a single 
processor. Intuitively, each equation in the system is a constraint on data and 
computation placement, and it is possible to overconstrain the system so that 
the trivial solution is the only solution. If so, the second step of alignment 
determines which constraints must be left unsatisfied to retain parallelism in 
execution. The cost of leaving a constraint unsatisfied is that it introduces 
communication; therefore, the constraints left unsatisfied should be those that 
introduce as little communication as possible. In the last step, the remaining 
constraints are solved to determine data and computation placement. 

The following loop illustrates these points. It computes the product Y of 
a sub-matrix A(I1 : -|- 10, II : -|- 10) and a vector X: 

DO i=l,N 
DO j=l,N 

Y(i) = Y(i) + A(i+10,j + 10)>t=X(j) 

For simplicity, assume that the virtual processors are organized as a one- 
dimensional grid T. Let us assume that computations are mapped by iteration 
number — that is, a processor does all or none of the work in executing an 
iteration of the loop. To avoid communication, the processor that executes 
iteration (i,j) must have A[i -\- 10, j -f 10), Y{i) and X{j) in its local mem- 
ory. These constraints can be expressed formally by defining the following 
functions that map loop iterations and array elements to virtual processors: 

C : {i,j) T processor that performs iteration (i,j) 

Da : (i,j) T processor that owns A(i,j) 

T>y '■ i 'T processor that owns Y (i) 

T>x '■ j "A processor that owns X(j) 

The constraints on these functions are the following. 

^ihi) ~ '^A{i + 10, j -|- 10) 
i, j s.t. I ^i,j : C(i,j) = Dy(i) 

C{i,j) = Dx{j) 

If we enforce all of the constraints, the only solution is the trivial solution 
in which all data and computations are mapped to a single processor. In this 
case, we say that our system is overconstrained. If we drop the constraint on 
X, we have a non-trivial solution to the resulting system of constraints, which 
maps iteration (i,j) to processor i, and maps array elements A(i-|-10, j-|-10). 
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X{i) and Y{i) to processor i. Note that all these maps are affine functions 
— for example, the map of array A to the virtual processors can be written 
as follows: 

Dy«,t) = ) j 1=) ; J I) “ [®) “ [ (ri) 

Since there is more than one processor involved in the computation, we 
have parallel execution of the program. However, elements of X must be 
communicated at runtime. 

In this example, the solution to the alignment equations was determined 
by inspection, but how does one solve such systems of equations in general? 
Note that the unknowns are general functions, and that each function may 
be constrained by several equations (as is the case for C in the example). To 
make the problem tractable, it is standard to restrict the maps to linear (or 
affine) functions of loop indices. This restriction is not particularly onerous in 
general - in fact, it permits more general maps of computation and data than 
are allowed in HPF. The unknowns in the equations now become matrices, 
rather than general functions, but it is still not obvious how such systems 
of matrix equations can be solved. In Section 2, we introduce our linear 
algebraic framework that reduces the problem of solving systems of alignment 
equations to the standard linear algebra problem of determining a basis for 
the null space of a matrix. One weakness of existing approaches to alignment 
is that they handle only linear functions; general affine functions, like the 
map of array A, must be dealt with in ad hoc ways. In Section 3, we show 
that our framework permits affine functions to be handled without difficulty. 

In some programs, replication of arrays is useful for exploiting parallelism. 
Suppose we wanted to parallelize all iterations of our matrix-vector multipli- 
cation loop. The virtual processor (i, j) would execute the iteration {i,j) and 
own the array element A(i -\- 10, j + 10). It would also require the array el- 
ement X{j). This means that we have to replicate the array X along the i 
dimension of the virtual processor grid. In addition, element Y (i) must be 
computed by reducing (adding) values computed by the set of processors 
{i, I). In Section 4, we show that our framework permits a solution to the 
replication/reduction problem as well. 

Finally, we give a systematic procedure for dropping constraints from over- 
constrained systems. Finding an optimal solution that trades off parallelism 
for communication is very difficult. First, it is hard to model accurately the 
cost of communication and the benefit of parallelism. For example, parallel 
matrix-vector product is usually implemented either by mapping rows of the 
matrix to processors (so-called 1-D alignment) or by mapping general sub- 
matrices to processors (so-called 2-D alignment). Which mapping is better 
depends very much on the size of the matrix, and on the communication to 
computation speed ratio of the machine [9]. Second, even for simple parallel 
models and restricted cases of the alignment problem, finding the optimal 
solution is known to be NP-complete problem [10]. Therefore, we must fall 
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back on heuristics. In Section 5, we discuss our heuristic. Not surprisingly, 
our heuristic is skewed to “do the right thing” for kernels like matrix-vector 
product which are extremely important in practice. 

How does our work relate to previous work on alignment? Our work is 
closest in spirit to that of Huang and Sadayappan who were the first to for- 
mulate the problem of communication-free alignment in terms of systems 
of equational constraints [6]. However, they did not give a general method 
for solving these equations. Also, they did not handle replication of data. 
Anderson and Lam sketched a solution method [1], but their approach is un- 
necessarily complex, requiring the determination of cycles in bipartite graphs, 
computing pseudo-inverses etc - these complications are eliminated by our 
approach. 

The equational, matrix-based approach described in this paper is not the 
only approach that has been explored. Li and Chen have used graph-theoretic 
methods to trade off communication for parallelism for a limited kind of 
alignment called axis alignment [10]. More general heuristics for a wide variety 
of cost-of-communication metrics have been studied by Chatterjee, Gilbert 
and Schreiber [2,3], Feautrier [5] and Knobe et al [7,8]. 

To summarize, the contributions of this paper are the following. 

1. We show that the problem of determining communication- free partitions 
of computation and data can be reduced to the standard linear algebra 
problem of determining a basis for the null space of a matrix , which can 
be solved using fairly standard techniques (Section 2.2). 

2. Previous approaches to alignment handle linear maps, but deal with affine 
maps in fairly ad hoc ways. We show that affine maps can be folded into 
our framework without difficulty (Section 3). 

3. We show how replication of arrays is handled by our framework (Sec- 
tion 4). 

4. We suggest simple and effective heuristic strategies for deciding when 
communication should be introduced (Section 5). 



2. Linear Alignment 

To avoid introducing too many ideas at once, we restrict attention to linear 
subscripts and linear maps in this section. First, we show that the alignment 
problem can be formulated using systems of equational constraints. Then, we 
show that the problem of solving these systems of equations can be reduced 
to the standard problem of determining a basis for the null space of a matrix, 
which can be solved using integer-preserving Gaussian elimination. 

2.1 Equational Constraints 

The equational constraints for alignment are simply a formalization of an 
intuitively reasonable statement: ‘to avoid communication, the processor that 
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performs an iteration of a loop nest must own the data referenced in that 
iteration’. We discuss the formulation of these equations in the context of the 
following example: 

DO j=l,100 
DO k=l,100 

B(j,k) = A(j,k) + A(k,j) 

If i is an iteration vector in the iteration space of the loop, the alignment 
constraints require that the processor that performs iteration i must own 
i?(Fii), A(Fii) and A(F 2 i), where Fi and F 2 are the following matrices: 





\ 0 1 ’ 


)o . 


1 0 



Let C, and be p ^2 matrices representing the maps of the com- 
putation and arrays A and i? to a p-dimensional processor template; p is an 
unknown which will be determined by our algorithm. Then, the alignment 
problem can be expressed as follows: find C, and such that 

Ci = DgFii 

i iteration space of loop : Ci = D^Fii 

Ci = D^F2i 

To ‘cancel’ the i on both sides of each equation, we will simplify the 
problem and require that the equations hold for all 2-dimensional integer 
vectors, regardless of whether they are in the bounds of the loop or not. 
In that case, the constraints simply become equations involving matrices, as 
follows: find C, and Ds such that 

C = DsFi 

C = D^Fi (2.1) 

C = T>aF2 

We will refer to the equation scheme C = DF as the fundamental equation 
of alignment. 

The general principle behind the formulation of alignment equations 
should be clear from this example. Each data reference for which alignment 
is desired gives rise to an alignment equation. Data references for which sub- 
scripts are not linear functions of loop indices are ignored; therefore, such 
references may give rise to communication at runtime. Although we have 
discussed only a single loop nest, it is clear that this framework of equa- 
tional constraints can be used for multiple loop nests as well. The equational 
constraints from each loop nest are combined to form a single system of simul- 
taneous equations, and the entire system is solved to find communication-free 
maps of computations and data. 
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2.2 Reduction to Null Space Computation 

One way to solve systems of alignment equations is to set C and D matrices 
to the zero matrix of some dimension. This is the trivial solution in which all 
computations and data are mapped to a single processor, processor 0. This 
solution exploits no parallelism; therefore, we want to determine a non-trivial 
solution if it exists. We do this by reducing the problem to the standard linear 
algebra problem of determining a basis for the null space of a matrix. 
Consider a single equation. 



C = DF 



This equation can be written in block matrix form as follows: 




Now it is of the form UV = 0 where U is an unknown matrix and V is a 
known matrix. To see the connection with null spaces, we take the transpose 
of this equation and we see that this is the same as the equation = 0. 

Therefore, is a matrix whose columns are in the null space of V^. To 
exploit parallelism, we would like the rank of to be as large as possible. 
Therefore, we must find a basis for the null space of matrix V^. This is done 
using integer-preserving Gaussian elimination, a standard algorithm in the 
literature [4,12]. 

The same reduction works in the case of multiple constraints. Suppose 
that there are s loops and t arrays. Let the computation maps of the loops 
be Cl, C 2 , . . . , Cs, and the array maps be Di, D 2 , . . . , D^. We can construct 
a block row with all the unknowns as follows: 



U = 



Cl C2 



Di 



D* 



For each constraint of the form C,- = D^F^, we create a block column 



0 

I 

v„ = -F 0 



where the zeros are placed so that: 



0 i- 
A 



UVq = Cj^DfeFf 



( 2 . 2 ) 



Putting all these block columns into a single matrix V, the problem of 
finding communication-free alignment reduces once again to a matrix equa- 
tion of the form 
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Input:. A set of alignment constraints of the form Cj = D^F^. 
Output:. Communication- free alignment matrices Cj and D^. 

1. Assemble block columns as in (2.2). 

2. Put all block columns Vg into one matrix V. 

3. Compute a basis for the null space of V^. 

4. Set template dimension to number of rows of U. 

5. Extract the solution matrices Cj and from U. 

6. Reduce the solution matrix U as described in Section 2.4. 

Fig. 2.1. Algorithm LINEAR-ALIGNMENT. 




U = 1 111111 



(2.5) 



which gives us: J L 

C = Da = Db =1 1 1 (2.6) 

Since the number of rows of U is one, the sclution r(?quires a one dimensional 
template. Iteration (i,j) is mapped to processor i + j. Arrays A and B are 
mapped identically so that the ‘anti-diagonals’ of these matrices are mapped 
to the same processor. 

The general algorithm is outlined in Figure 2.1. 



2.3 Remarks 

Our framework is robust enough that we can add additional constraints to 
computation and data maps without difficulty. For example, if a loop in a 
loop nest carries a dependence, we may not want to spread iterations of that 
loop across processors. More generally, dependence information can be char- 
acterized by a distance vector z, which for our purposes says that iterations i 
and i-|-z have to be executed on the same processor. In terms of our alignment 
model: 

Ci-bb = C(i-bz) + b \ Cz = 0 (2.7) 

We can now easily incorporate (2.7) into our matrix system (2.3) by adding 
the following block column to V : 
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where zeros are placed to that UV^ep = Cz. Adding this column to V will 
ensure that any two dependent iterations end up on the same processor. 

In some circumstances, it may be necessary to align two data references 
without aligning them with any computation. This gives rise to equations 
of the form DiFi = D2F2. Such equations can be incorporated into our 
framework by adding block columns of the form 



Vp = 



where the zeros are placed so that UV„ = D] 



2.4 Reducing the Solution Basis 

Finally, one practical note. It is possible for Algorithm LINEAR-ALIGNMENT 
to produce a solution U which has p rows, even though all Cj produced by 
Step 5 have rank less than p. A simple example where this can happen is a 
program with two loop nests which have no data in common. Mapping the so- 
lution into a lower dimensional template can be left to the distribution phase 
of compiling; alternatively, an additional step can be added to Algorithm 
LINEAR-ALIGNMENT to solve this problem directly in the alignment phase. 
This modification is described next. 

Suppose we compute a solution which contains two computation align- 
ments: 

U = 1 Cl C2 ... ( 2 . 9 ) 

Let r be the number of rows in U. Let ri be th(? rank of Ci, and let r 2 be 
the rank of C2- Assume that ri < r2- We would like to have a solution basis 
where the first ri rows of Ci are linearly independent, as are the first r2 rows 
of C2 — that way, if we decide to have an ri -dimensional template, we are 
guaranteed to keep ri degrees of parallelism for the second loop nest, as well. 

Mathematically, the problem is to find a sequence of row transformations 
T such that the first ri rows of TCi are linearly independent and so are the 
first T2 rows of TC2. 

A detailed procedure is given in the appendix. Here, we describe the 
intuitive idea. Suppose that we have already arranged the first ri rows of Ci 
to be linearly independent. Inductively, assume that the first k < V2 rows of 
C2 are linearly independent as well. We want to make the k + 1 -st row of 
C2 linearly independent of the previous k rows. If it already is, we go the 
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next row. If not, then there must be a row rn > fc + 1 of C 2 which is linearly 
independent of the first k rows. It is easy to see that if we add the m-th row 
to the k + 1-st row, we will make the latter linearly independent of the first k 
rows. Notice that this can mess up Ci! Fortunately, it can be shown that if 
we add a suitably large multiple of the m-th row, we can be sure that the first 
ri rows of Cl remain independent. This algorithm can be easily generalized 
to any number of Ci blocks. 



3. Affine Alignment 

In this section, we generalize our framework to affine functions. The intuitive 
idea is to ‘encode’ affine subscripts as linear subscripts by using an extra 
dimension to handle the constant term. Then, we apply the machinery in 
Section 2 to obtain linear computation and data maps. The extra dimension 
can be removed from these linear maps to ‘decode’ them back into affine 
maps. 

We first generalize the data access functions so that they are affine 
functions of the loop indices. In the presence of such subscripts, aligning data 
and computation requires affine data and computation maps. Therefore, we 
introduce the following notation. 



Computation maps: 


C'i(i) = Cji -1- Cj 


(3.1) 


Data maps: 


Dk{&) = Dfea-b dfe 


(3.2) 


Data access functions: 


Fi{{) = F^i-hf^ 


(3.3) 



Cj, Dfe and are matrices representing the linear parts of the affine 
functions, while Cj, d^, and represent constants. The alignment constraints 
from each reference are now of the form 



i/Z" : C,-i + c,- =Dfe(F^i + f^) + dfe (3.4) 

3.1 Encoding Affine Constraints as Linear Constraints 

Affine functions can be encoded as linear functions by using the following 
identity. 



Tx -1- 1 



T t 



X 

1 



(3.5) 



where T is a matrix, and t and x are vectors. We can put (3.4) in the form: 
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= Dfe 



Now we let: 



C,= 



F. f/ 



1 


\Lf, I 

r 


1 


DM 


J 


/ 







(3.6) can be written as 



0 

1 


1 

II 

..Si 

<Q 


Dfe dfe 


6 \ F( f( r 
= 0 1 


J ... 


L J 




/ 



i /Z”* : Cj 



— DfcF^ 



As before, we would like to ‘cancel’ the vector j ^ 
the equation. To do this, we need the following result. 
Lemma 31. Let T be a matrix, t a vector. Then 

= 0 

Proof:. In particular, we can let x = 0. This gives us: 

= t = 0 



(3.6) 



(3.7) 



(3.8) 



from both sides of 





\ X r 


T t 




D. 


/ L 





^ ? [ 


T t 






7 x: 





T t 



X 

1 



T 0 



which means that T = 0, as weir. □ 



X 

1 



= Tx = 0 



Using Lemma 31, we can rewrite (3.8) as follows: 

Cj = t)kfi (3.9) 

We can now use the techniques in Section 2 to reduce systems of such 
equations to a single matrix equation as follows: 

UV = 0 (3.10) 

In turn, this equation can be solved using the Algorithm LINEAR-ALIGN- 
MENT to determine U. To illustrate this process, we use the example from 
Section 1: 
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DO i=l,N 
DO j=l,N 

Y(i) = Y(i) + A(i+10,j+10)*X(j) 

Suppose we wish to satisfy the constraints for Y and A. The relevant array 
access functions are: 



Fa = 
Fy = 

Fa = 




fA = 
fy = 

Fv = 



10 

10 



0 0 
0 0 1 



(3.11) 



The reader can verify that the matrix equation to be solved is the following 
one. 



UV = 0 



(3.12) 



where: 







V = - 


I I r 


tJ = 


C Da Dy 


^Fa 0 s 






- 


0 0Fy / 








\ 



And the solution is the following matrix. 

U = 



1 0 0 1 0 ®10 1 0 

0 0 1 0 0 1 0 1 



(3.13) 



From this matrix, we can read off the following maps of computation and 
data. 




D 



A = 



1 0 ®10 
0 0 1 



This says that iteration i of the loop and element X (i) are mapped to the 
following virtual processor. 




Notice that although the space of virtual processors has two dimensions 
(because of the encoding of constants) , the maps of the computation and data 
use only a one-dimensional subspace of the virtual processor space. To obtain 
a clean solution, it is desirable to remove the extra dimension introduced by 
the encoding. 
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Input:. A set of alignment constraints as in Equation (3.4). 

Output:. Communication-free alignment mappings characterized by 

1. Assemble matrices as in Equation 3.6. 

2. Assemble block columns Vg as in Equation (2.2) using instead 
of F^ 

3. Put all block columns Vg into one matrix V. 

4. Compute a basis for null-space of as in the Step 3 of 
LINEAR-ALIGNMENT algorithm. 

5. Eliminate redundant row(s) in U. 

6. Extract the solution matrices from U. 



Fig. 3.1. Algorithm AFFINE-ALIGNMENT. 



We have already mentioned that there is always a trivial solution that 
maps everything to the same virtual processor p = 0. Because we have in- 
troduced affine functions, it is now possible to map everything to the same 
virtual processor p s 0. In our framework it is reflected in the fact that there 
is always a row 



w 



T 



0 0 



0 1 0 



0 1 



0 0 1 



(3.14) 



(with zeros placed appropriately) in the row space of the solutidn matrix U. 
To “clean up” the solution notice that we can always find a vector x such 
that x^U = w^. Moreover, let k be the position of some non-zero element 
in X and let J be an identity matrix with the /c-th row replaced by x^ (J is 
non-singular). Then the /c-th row of U = JU is equal to and is linearly 
independent from the rest of the rows. This means that we can safely remove 
it from the solution matrix. Notice that this procedure is exactly equivalent 
to removing /c-th row from U. A more detailed description is given in the 
appendix. 

Algorithm AFFINE-ALIGNMENT is summarized in Figure 3.1. 



4. Replication 

As we discussed in Section 1, communication- free alignment may require 
replication of data. Currently, we allow replication only of read-only arrays 
or of the arrays which are updated using reduction operations. In this section, 
we show how replication of data is handled in our linear algebra framework. 
We use a matrix-vector multiplication loop (MVM) as a running example. 

DO i=l,N 
DO j=l,N 

Y(i) = Y(i) + A(i,j)*X(j) 
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We are interested in deriving the parallel version of this code which uses 2-D 
alignment — that is, it uses a 2-dimensional template in which processor {i, j) 
performs iteration ( i , j ) . If we keep the alignment constraint for A only, we 
get the solution: 






1 0 
0 1 



(4.1) 



which means that iteration (i,j) is executed on the processor with coor- 
dinates This processor also owns the array element A(i,j). For the 

computation, it needs X{j) and Y[i). This requires that X be replicated 
along the i dimension of the processor grid, and Y be reduced along the j 
dimension. We would like to derive this information automatically. 



4.1 Formulation of Replication 

To handle replication, we associate a pair of matrices R and D with each data 
reference for which alignment is desired; as we show next, the fundamental 
equational scheme for alignment becomes RC = DF. 

Up to this point, data alignment was specified using a matrix D which 
mapped array element a to logical processor Da. If D has a non-trivial null- 
space, then elements of the array belonging to the same coset of the null-space 
get placed onto the same virtual processor; that is, 

Dai = Da2 

\ 

ai 0 a 2 ^ null(D) 

When we allow replication, the mapping of array elements to processors 
can be described as follows. Array element a is mapped to processor p if 

Rp = Da 

The mapping of the array is now a many-to-many relation that can be de- 
scribed in words as follows: 

— Array elements that belong to the same coset of null(D) are mapped onto 
the same processors. 

— Processors that belong to the same coset of null(R) own the same data. 

From this, it is easy to see that the fundamental equation of alignment 
becomes RC = DF. The replication- free scenario is just a special case when 
R is I. Not all arrays in a procedure need to be replicated — for example, 
if an array is involved in a non-reduction dependence or it is very large, we 
can disallow replication of that array. Notice that the equation RC = DF 
is non-linear if both R and C are unknown. To make the solution tractable, 
we first compute C based on the constraints for the non- replicated arrays. 
Once C is determined, the equation is again linear in the unknowns R and 
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D. Intuitively, this means that we first drop some constraints from the non- 
replicated alignment system, and then try to satisfy these constraints via 
replication. 

We need to clarify what “fixing C” means. When we solve the alignment 
system (2.3), we obtain a basis Cbasis for all solutions to the loop alignment. 
The solutions can be expressed parametrically as 

C = TCbasis (4.2) 

for any matrix T. Now the replication equation becomes 

RTCbasis = DF (4.3) 

and we are faced again with a non-linear system (T is another unknown)! 
The key observation is that if we are considering a single loop nest, then 
T becomes redundant since we can “fold it” into R. This lets us solve the 
replication problem for a single loop nest. 

In our MVM example, once the loop alignment has been fixed as in (4.1), 
the system of equations for the replication of X and Y is: 

RjyC = DxFx 

RyC = DyFy 



U = 



T>x D, 



These can be solved independently or put together into a block-matrix form 

UV = 0: 

Ry 
0 
c 
0 



V = 



Rx 

c 

-T Dx 

0 Dy 

and solved using the standard! methods. Th^ solution to this system: 



(4.4) 



Rx = 
Ry = 

which is the desired result: co 



Dx 

Dy 



(4.5) 



umns o|^the processor grip form the cosets of 
null(Rx) and rows of the processor grid form the cosets of null(Ry). 

The overall Algorithm SINGLE-LOOP-REPLICATION-ALIGNMENT is summa- 
rized in Figure 4.1. 



5. Heuristics 

In practice, systems of alignment constraints are usually over-determined, so 
it is necessary to drop one or more constraints to obtain parallel execution. 
As we mentioned in the introduction, it is very difficult to determine which 
constraints must be dropped to obtain an optimal solution. In this section, 
we discuss our heuristic which is motivated by scalability analysis of common 
computational kernels. 
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Input:. Replication constraints of the form RC = DF. 

Output:. Matrices R, D and Chasis that specify alignment with repli- 
cation. 

1. Find Chasis by solving the alignment system for the non- 
replicated arrays using the Algorithm AFFINE-ALIGNMENT. If all 
arrays in the loop nest are allowed to be replicated, then set 
^basis I- 

2. Find (R, D) pairs that specify replication by solving the 
RCfcasis = DF equations. 



Fig. 4.1. Algorithm SINGLE-LOOP-REPLICATION-ALIGNMENT. 



5.1 Lessons from Some Common Computational Kernels 

We motivate our ideas by the following example. Consider a loop nest that 
computes matrix-matrix product: 

DO i=l,n 
DO j=l,n 
DO k=l,n 

C(i,j) = C(i,j) + A(i,k)=t=B(k, j) 

[9] provides the description of various parallel algorithms for matrix-matrix 
multiplication. It is shown that the best scalability is achieved by an algorithm 
which organizes the processors into a 3-D grid. Let p, q and r be the processor 
indices in the grid. Initially, A is partitioned in 2-D blocks along the p-r “side” 
of the grid. That is, if we let be a block of A, then it is initially placed on 
processor with the coordinates (p, 0,r). Similarly, each block i?”® is placed 
on processor (0,q,r). Our goal is to accumulate the block of the result 
on the processor {p, q, 0). 

At the start of the computation, A is replicated along the second (q) 
dimension of the grid. B is replicated along the first dimension (p). Therefore, 
we end up with processor (p,q,r) holding a copy of and R”®. Then each 
processor computes the local matrix- matrix product: 

Ijpar ^ J^pr ^ ^rq 

It is easy to see that the blocks of C are related to these local products by: 

(jpq ^ y jjpqr (-5 2) 

Therefore, after the local products are computed, they are reduced along the 
r dimension of the grid. 

We can describe this computation using our algebraic framework. There 
is a 3-D template and the computation alignment is an identity. Each of the 
arrays is replicated. For example the values of D and R for the array A are: 
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R = 



1 

0 



0 0 
0 1 




(5.3) 



By collapsing different dimensions of the 3-D grid, we get 2-D and 1-D 
versions of this code. In general, it is difficult for a compiler to determine 
which version to use — the optimal solution depends on the size of the ma- 
trix, and on the overhead of communication relative to computation of the 
parallel machine [9]. On modern machines where the communication over- 
head is relatively small, the 3-D algorithm is preferable, but most alignment 
heuristics we have seen would not produce this solution — note that all ar- 
rays are communicated in this version! These heuristics are much more likely 
to “settle” for the 2-D or 1-D versions, with some of the arrays kept local. 

Similar considerations apply to other codes such as matrix-vector prod- 
uct, 2-D and 3-D stencil computations, and matrix factorization codes [9]. 
Consider stencil computations. Here is a typical example: 



DO i=l,N 
DO j=l,N 

A(i,j) = . . .B(i-1, j) . . .B(i+1, j) . . . 

. . .B(i,j) . . .B(i,j-1) . . .B(i,j+D) 

In general, stencil computations are characterized by array access functions 
of the form Fi -|- f^, where the linear part F is the same for most of the ac- 
cesses. The difference in the offset induces nearest-neighbor communication. 
We will analyze the communication/computation cost ratio for 1-D and 2-D 
partitioning of this example. For the 1-D case, the N-hy-N iteration space 
is cut up into N / P-hy-N blocks. If N is large enough, then each processor 
has to communicate with its “left” and “right” neighbors, and the volume of 
communication is 2N . We can assume that the communication between the 
different pairs of neighbors happens at the same time. Therefore, the total 
communication time is 0{2N). The computation done on each processor is 
0{N‘^ / P), so the ratio of communication to coni|iutation is^{P/N). In the 
2-D case, the iteration space is cut up into N/ P-hy-N/ P blocks. Each 
processor now has fou|jeighbors to communicate with, and th^ volume of 
communication is 4 A^/ P. Therefore, the ratio for this case is 0( P/iV). We 
conclude that 2-D case scales better than 1-D case^. In general, if we have a 
d-dimensional stencil-like computation, then it pays to have a d-dimensional 
template. 

The situation is somewhat different in matrix and vector products and 
matrix factorization codes (although the final result is the same). Let us 
consider matrix-vector product together with some vector operation between 
X and Y : 

^ In fact, the total volume of communication is smaller in the 2-D case, despite the 
fact that we had fewer alignment constraints satisfied (this paradoxical result 
arises from the fact that the amount of communication is a function not just of 
alignment but of distribution as well). 
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DO i=l,N 
DO j=l,N 

Y(i) = Y(i) + A(i,j) * X(j) 

DO i=l,N 

X(i) = . . ,Y(i) . . . 

This fragment is typical of many iterative linear system solvers ( [11]). One 
option is to use a 1-D template by leaving the constraint for X in the matrix- 
vector product loop unsatisfied. The required communication is an all-to-all 
broadcast of the elements of X . The communication cost is 6>(A^log(P)). The 
computation cost is 0{N‘^ / P). This gives us communication to computation 
ratio of 6>(log(P)P/A^). ^ ^ 

In the 2-D version, each processor gets an N/ P^y-N / P block of the 
iteration space and A. X and Y are partitioned in P pieces placed along 
the diagonal of the processor grid ( [9]). The algorithm is somewhat similar 
to matrix-matrix multiplication: each block of X gets broadcast along the 
column dimension, each block of Y is computed as the sum-reduction along 
the row dimension. Note that because each l^oadca.s^or reduction happe^ in 
parallel, the communication cost is 6?(log( P)N / P) = 0(log(P)^/ P). 
This results on the communication to computation ratio of 6>(log(P) P/N). 
Although the total volume of communication is roughly the same for the 1-D 
and 2-D case, the cost is asymptotically smaller in the 2-D case. Intuitively, 
the reason is that we were able to parallelize communication itself. 

To reason about this in our framework, let us focus on matrix-vector 
product, and see what kind of replication for X we get for the 1-D and 2-D 
case. In the 1-D case, the computation alignment is: 



C = 



The replication equation RC = DFJfor X isL 



1 0 



(5.4) 



R 



X 



The only solution is: 



1 0 


1 

Q 

II 


0 1 




L J 





Rx = Djy = 1 0 



(5.5) 



(5.6) 



This means that every processor gets all glem^ts of X — i.e., it is an all- 
to-all broadcast. We have already computed the alignments for the 2-D case 
in Section 4. Because Rx has rank 1, we have a “parallelizable” broadcasts 
— that is, the broadcast along different dimensions of the processor grid 
can happen simultaneously. In general, if the replication matrix has rank 
r and the template has dimension d, then we have broadcasts along d 0 
r dimensional subspaces of the template. The larger r, the more of these 
broadcasts happen at the same time. In the extreme case r = d we have a 
replication-free alignment, which requires no communication, at all. 
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5.2 Implications for Alignment Heuristic 

The above discussion suggests the following heuristic strategy. 

— If a number of constraints differ only in the offset of the array access 
function, use only one of them. 

— If there is a d-dimensional DOALL loop (or loop with reductions), use a 
d-dimensional template for it and try to satisfy conflicting constraints via 
replication. Keep the d-dimensional template if the rank of the resulting 
replication matrices is greater than zero. 

— If the above strategy fails, use a greedy strategy based on array dimensions 
as a cost measure. That is, try to satisfy the alignment constraints for the 
largest array first (intuitively, we would like large arrays to be “locked in 
place” during the computation) . This is the strategy used by Feautrier [5] . 

Intuitively, this heuristic is biased in favor of exploiting parallelism in 
DO-ALL loops, since communication can be performed in parallel before 
the computation starts. This is true even if there are reductions in the loop 
nest, because the communication required to perform reductions can also be 
parallelized. This bias in favour of exploiting parallelism in DO-ALL loops 
at the expense of communication is justified on modern machines. 



6. Conclusion 

We have presented a simple framework for the solution of the alignment 
problem. This framework is based on linear algebra, and it permits the de- 
velopment of simple and fast algorithms for a variety of problems that arise 
in alignment. 
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A. Reducing the Solution Matrix 



As we mentioned in Section 2.4, it is possible for our solution procedure 
to produce a matrix U which has more rows than the rank of any of the 
computation alignments C^. Intuitively, this means that we end up with a 
template that has a larger dimension than can be exploited in any loop nest 
in the program. Although the extra dimensions can be ‘folded’ away during 
the distribution phase, we show how the problem can be eliminated by adding 
an extra step to our alignment procedure. First, we discuss two ways in which 
this problem can arise. 



A.l Unrelated Constraints 



Suppose we have two loops with iteration alignments Ci and C 2 and two 
arrays A and B with data alignments and D^. Furthermore, only A is 
accessed in loop 1 via access function F^^ and only B is accessed in loop 2 
via access function Fs The alignment equations in this case are: 

Cl = D^F^ (A.l) 

C 2 = DbFb (A.2) 



We can assemble this into a combined matrix equation: 



U 

V 

uv 



Cl C2 Ds 

I 0 

- 0 I ; 

-■ 0Fa 0 

0 <8)Fb 

0 A 



(A.3) 



Say, C]^ and are the solution to (A.l). And C 2 and are the solution 
to (A.2). Then it is not hard to see that the following matrix is a solution to 
(A.3): 



\ Cl 


0 




0 




C 2 


0 


Db 



(A.4) 



So we have obtained a processor space with the dimension being the sum 
of the dimensions allowed by (A.l) (say, p\) and (A.2) (say p 2 )- However, 
these dimensions are not fully utilized since only the first p\ dimensions are 
used in loop 1, and only the remaining p 2 dimensions are used in loop 2. 



^ For simplicity we are considering linear alignments and subscripts. For affine 
alignments and subscripts the argument is exactly the same after the appropri- 
ate encoding. 
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This problem is relatively easy to solve. In general, we can model the 
alignment constraints as an undirected alignment constraint graph G whose 
vertices are the unknown D and C alignment matrices; an edges {x, y) repre- 
sents an alignment equation constraining vertex x and vertex y. alignments. 
We solve the constraints in each connected component separately, and choose 
a template with dimension equal to the maximum of the dimensions required 
for the connected components. 



A. 2 General Procedure 



Unfortunately, extra dimensions can arise even when there is only one com- 
ponent in the alignment constraint graph. Consider the following program 
fragment: 



DO i=l,n 
DO j=l,n 

. . .A(i,0, j) . . . 

The alignment equation for this loop is: 

C = Da 



The solution is: 



1 0 
0 0 
0 1 







1 


0 


1 


0 


0 \ 


C D 


= - 


0 


0 


0 


1 


0 I 


_ 


_ 


0 


1 


0 


0 





U = 



So we have rank(C) = 2 < rank(U). If we use this solution, we end up placing 
the unused dimension of A onto an extra dimension of virtual processor space. 
We need a way of modifying the solution matrix U so that: 



rank(U) = max^rank(Ci)'0 (A. 5) 

i 

For this, we apply elementary (unimodular) row operations ^ to U so that 
we end up with a matrix U in which the first rank(Ci) rows of each 
component form a row basis for the rows of this component. We will say that 
each component of U is reduced. By taking the first maxi^rank(Ci)<|)rows of 
U we obtain a desired solution W. 

In our example matrix U is not reduced: the first two rows of C do not 
for a basis for all rows of C. But if we add the third row of U to the second 
row, we get U with desired property: 

^ Multiplying a row by ±1 and adding a multiple of one row to another are 
elementary row operations. 
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U = ^ 



1 1 0 


1 


0 




1 — 1 

0 

— 1“ 


0 


1 


1 1 


0 1 


0 


0 




s of U we 


obtain a s 



induce unused processor dimensions. 

The question now becomes: how do we systematically choose a sequence 
of row operations on U in order to reduce its components? Without loss of 
generality, lets assume that U only consists of components: 



U 



Cl C2 ... c. 



Let: 



(A.6) 



— q he the number of rows in U. Also, by construction of U, </ = rank(U). 

— Ti be the rank of Ci for i = 1, . . . , s. 

“ ^max = maxi^rank(Ci)'0 Notice that rmax = q, in general. 

We want to find matrix W, so that: 



— number of rows in W equals to rmax- 

— each component of W has the same rank as the corresponding component 

of U. 



Here is the outline of our algorithm: 

1. Perform elementary row operations on U to get U in which every com- 
ponent is reduced. 

2. Set W to the first Vmax rows of U . 

The details are filled in below. 

We need the following Lemma: 

Lemma Al. Let ai, . . . , a^, a^+i, . . . , a„ be some vectors. Furthermore as- 
sume that the first r vectors form a basis for the span ai, . . . , a^. Let: 

r 

afe = \ Pj&j (A.7) 

r 

be the representation of a.k in the basis above. Then the vectors ai, . . ., a^^i, 
a^ + asik are linearly independent (and form a basis) if and only if: 



1 -)- (xfir = 0 



Proof:. 



Sij. — a^. -j- D; \ 



r01 



— ^r(l “1“ Q:/?r) “h Q: \ /3j^j 

oh 



(A.8) 



(A.9) 
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Now if in the equation (A. 9) {l + a[3r) = 0, then the vectors ai, . . . , an(gii, a^ + 
asik are linearly dependent. Vice versa, if (1 + aj3r) = 0, then these vectors 
are independent by the assumption on the original first r vectors. □ 

Lemma Al forms a basis for an inductive algorithm to reduce all compo- 
nents of U. Inductively assume that we have already reduced Ci, . . . , Cfc,gii. 
Below we show how to reduce Cfe, while keeping the first fc 0 1 components 
reduced. 

Let 



h a, 



C, = 1 



T i 




we want the first rj rows to be linearly independent. Assume inductively that 
the first i 0 1 rows {i < rj) are already linearly independent. There are two 
cases for the i-th row (a?"): 



1. af is linearly independent from the previous rows. In this case we just 
move to the next row. 

2 . ai = he. ai is linearly dependent on the previous rows. 

Note tlmt since rj = rank{Cj) > i, there is a row a^, which is lin- 
early independent from the first i 0 1 rows. Because of this the rows 
ai, . . . , ai 0 i, ai -|- aa^ are linearly independent for any a ^ 0 . 

Lemma Al tells us that we can choose a so that the previous components 
are kept reduced. We have to solve a system of inequalities like; 










a 0 
e 0 

^ 0 



(A.IO) 



where /?ri\ • • • , conie from the inequalities (A. 8 ) for each compo- 

nent. are rational numbers: /slf = So we have to solve a 

system of inequalities: 



ar]i 

mi2 



0^2 



It is easy to see that a 



— 

axit4i4»>+ 



1 is a solution. 



(A.ll) 



The full algorithm for communication-free alignment ALIGNMENT-WITH-FIX- 
UP is outlined in Figure A.l. 
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Input:. A set of encoded affine alignment constraints as in Equation 
(3.4). 

Output:. Communication-free alignment mappings characterized by 
Cj, Cj, Dfc, dfe which do not induce unused processor dimensions. 

1. Form alignment constraint graph G. 

2. For each connected component of G: 

a) Assemble the system of constraints and solve it as described 
in Algorithm AFFINE-ALIGNMENT to get the solution matrix 
U. 

b) Remove the extra row of of U that was induced by affine 
encoding. (Section B) 

c) If necessary apply the procedure described in Section A. 2 to 
reduce the computation alignment components of U. 

Fig. A.l. Algorithm ALIGNMENT-WITH-FIXUP 

B. A Comment on AfRne Encoding 

Finally, we make a remark about affine encoding. A sanity check for align- 
ment equations is that there should always be a trivial solution which places 
everything onto one processor. In the case of linear alignment functions and 
linear array accesses, we have a solution U = 0. When we use affine functions, 
this solution is still valid, but there is more. We should be able to express a 
solution U s 0 that places everything on a single non-zero processor. Such 
a solution would have = 0, Dj = 0, = 1. Or, using our affine 

encoding: 

Ci = 1 0 0 ... 0 I 1 

Dj = ‘ 0 0 ... 0 0 I 1 

Below, we prove that solution df this form always exists; moreover, this gives 
rise to an extra processor dimension which can be eliminated without using 
the algorithm of Section A. 

Let the matrix of unknowns be: 

U = 1 Cl ... a Di ... Dt 

Also let: -I 

— rrii be the number of columns of Ci for i = {rrii is the dimension 

of the fth loop.) 

— rUs-^-i be the number of columns of Di for i = 1, ... ,t- {n^s+i is the dimen- 
sion of the (s -F f)th array.) 

— /'Z^, = 1 0 0 ... 0 f F. (fc (S> 1 zeros followed by a 1.) 
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r 



- e 

- e 



iris + i 



■'rris+t 



It is not hard to show that w^V = 0 . In particular, we can show that 
vector w is orthogonal to every block column Vg that is assembled into V. 
Suppose that Vg corresponds to the equation: 



C, = 






Therefore: 



V, = 



0 
I 

-1 

0 



Note that Vg has rui columns (the dimension of the ith loop) and the last 
column looks like (check the definition of F in Section 3 . 1 : 



0 
1 
0 
0 

- 0 
- 0 
= 0 



0 



with 1 and 0 l placed in the same positions as the Is in w. It is clear that w 
is orthogonal to this column of Vg . w is also orthogonal to the other columns 
of Vg, since only the last column has non-zeros, where w has Is. 
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How can we remove an extra dimension in U that corresponds to w? Note 
that in general U will not have a row that is a multiple of w! Suppose that 
U has r = rank(U) rows: 

U = 

Since w^V = 0, we have that 

w / null(V^) 

But rows of U form a basis for null(V^). Therefore: 

w span(ui, . . . , u^) (B.l) 

Let X be the solution to: 

x^U = 
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Now we can just remove the row to get non-trivial solutions! Notice that 
we don’t really have to form J(x) — we have to find x (using Gaussian 
elimination) and then remove the £th row from U such that Xi ^ 0. 
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Summary. Due to a significant communication overhead of sending and receiving 
data, the loop partitioning approaches on distributed memory systems must guar- 
antee not just the computation load balance but computation-|-communication load 
balance. The previous approaches in loop partitioning have achieved a communi- 
cation-free, computation load balanced iteration space partitioning solution for a 
limited subset of DOALL loops [6]. But a large category of DOALL loops inevitably 
result in communication and the tradeoffs between computation and communica- 
tion must be carefully analyzed for those loops in order to balance out the combined 
computation time and communication overheads. 

In this work, we describe a partitioning approach based on the above moti- 
vation for the general cases of DOALL loops. Our goal is to achieve a computa- 
tion-bcommunication load balanced partitioning through static data and iteration 
space distribution. First, code partitioning phase analyzes the references in the body 
of the DOALL loop nest and determines a set of directions for reducing a larger de- 
gree of communication by trading a lesser degree of parallelism. The partitioning is 
carried out in the iteration space of the loop by cyclically following a set of direction 
vectors such that the data references are maximally localized and re-used eliminat- 
ing a large communication volume. A new larger partition owns rule is formulated to 
minimize the communication overhead for a compute intensive partition by localiz- 
ing its references relatively more than a smaller non-compute intensive partition. A 
Partition Interaction Graph is then constructed that is used to merge the partitions 
to achieve granularity adjustment, computation-|-communication load balance and 
mapping on the actual number of available processors. Relevant theory and algo- 
rithms are developed along with a performance evaluation on Cray T3D. 



1. Introduction 

The distributed memory parallel architectures are quite popular for highly 
parallel scientific software development. The emergence of better routing 
schemes and technologies have reduced the inter-processor communication 
latency and increased the communication bandwidth by a large degree mak- 
ing these architectures attractive for a wide range of applications. 

Compiling for distributed memory systems continues to pose complex, 
challenging problems to the researchers. Some of the important research direc- 
tions include, data parallel languages such as HPF/Fortran 90D [4,12,13,16], 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 413-443, 2001. 
Springer-Verlag Berlin Heidelberg 2001 
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communication free partitioning [1,6,14,22], communication minimization 
[2,17,19], array privatization [30], data alignment [1,3,5,11,18,24,26,29,33], 
load balancing through multi- threading [27], mapping functional parallelism 
[8,9,20,21,23], compile and run time optimizations for irregular problems 
[25,28] and optimizing data redistributions [15,31]. 

The focus of most of these approaches is on eliminating as much inter- 
processor communication as possible. The primary motivation behind such 
approaches is that the data communication speeds on most of the distributed 
memory systems are orders of magnitude slower than the processor speeds. 
In particular, the loop partitioning approaches on these systems attempt to 
fully eliminate the communication through communication free partitioning 
for a sub-set of DOALL loops [1,6,14,22]. These methods first attempt to 
find a communication free partition of the loop nest by determining a set of 
hyperplanes in the iteration and data spaces of the loop and then attempt to 
load balance the computation [6] . However, eommunication free partitioning 
is possible only for a very small, highly restrictive sub-class of DOALL loops 
and partitioning of general DOALL loops inevitably results in communication 
in the absence of any data replication. In these cases, the important goals of 
the loop partitioning strategy are to minimize the communication possibly 
trading parallelism and to achieve a computation-|-communication load bal- 
ance for almost equal execution times of the generated loop partitions. The 
literature does not present comprehensive solutions to the above issues and 
this is the focus of our paper. 

Section 2 describes the previous work on DOALL partitioning on dis- 
tributed memory systems and discusses our approach. Section 3 introduces 
necessary terms and definitions. Section 4 develops the theory and section 5 
discusses the algorithms for DOALL iteration and data space partitioning. 
Section 6 discusses the algorithms for granularity adjustment, load balancing 
and mapping. Section 7 illustrates the methods through an example. Section 
8 deals with the performance results on Cray T3D and conclusions. 



2. DOALL Partitioning 

The DOALL loops offer the highest amount of parallelism to be exploited 
in many important applications. The primary motivation in DOALL par- 
titioning on distributed memory systems is reducing data communication 
overhead. The previous work on this topic has focused on completely elim- 
inating communication to achieve a communication free iteration and data 
space partition [1,6,14,22]. But in many practical DOALL loops, the eom- 
munication free partitioning may not be possible due to the incompatible 
reference instances of a given variable encountered in the loop body or due 
to incompatible variables [14,22]. The parallelization of such DOALL loops 
is not possible by the above approaches. In this work, our motivation is to 
develop an iteration and data space partitioning method for these DOALL 
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loops where the reference patterns do not permit a communication free parti- 
tion without replicating the data. We attempt to minimally trade parallelism 
to maximally eliminate communication. Our other objective is to achieve a 
computation-hcommunication load balanced partitioning of the loop. The fo- 
cus, thus, is on and computation+ communication load balanced partitioning 
as against communication elimination and eomputation load balance for re- 
stricted cases as in previous approaches [6]. We choose not to replicate the 
data since it involves a point-point or broadcast type communication and 
poses an initial data distribution overhead on every loop slice. 

We first motivate our approach through an example. 



2.1 Motivating Example 

Consider the following DOALL loop: 

for i = 2 to N 
for j = 2 to N 

A[i,j] = B[i-2,j-l]+B[i-l,j-l]+B[i-l,j-2] 
endf or 

endf or 

As far as this loop is concerned, it is not possible to determine a com- 
munication free data and iteration space partition [1,6,14,22]. The reason 
this loop can not be partitioned in a communication free manner is that we 
can not determine a direction which will partition the iteration space so that 
all the resulting data references can be localized by suitably partitioning the 
data space of B without any replication. 

The question is : can we profitably (speedup > 1) parallelize this loop 
in any way ? If there are many ways for such a parallelization, which one 
will give us computation-j-communication load balance for the best speedup? 
We illustrate that by carefully choosing the iteration and data distributions, 
it is possible to maximally eliminate the communication while minimally 
sacrificing the parallelism to minimize the loop completion time and maximize 
the speedup. It is then possible to construct a partition interaction graph to 
adapt the partitions for granularity and computation-hcommunication load 
balance and to map them on the available number of processors for a specific 
architecture. 

One approach is to replicate each element of matrix B on each processor 
and partition the nested DOALL iterations on a N x N mesh so that each 
processor gets one iteration. This, however, has a data distribution overhead 
N“^ and thus, is not a good solution where cost of communication is higher 
than computation. This method has the maximum parallelism, but it will not 
give any speedup due to a very high data distribution overhead. The other 
possibility is to minimize communication by carefully choosing iteration and 
data distributions. 
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Table 2.1. Comparison of effect of partitioning on parallelism, communica- 
tion and loop execution time 



Dir. 


#Part. 


Commn. 


Execution 

Time 


(-1,0) 


(N-1) 


4(N-2)(N-1) 


(N-1)+ci(2N-3)+C2(2N-3) 


(0,1) 


(N-1) 


4(N-2)(N-1) 


(N-1)+ci(2N-3)+C2(2N-3) 


(-1,1) 


(2N-3) 


4(N-2)(N-1) 


(N-1)+2ci(N-1)+2c2(N-1) 


Cyclic 


(N-1) 


2(N-2)(N-1) 


(2N-5)-h2c2 (2iV05)/2^ci (2N05)/2- 



In the above example, it is possible to determine communication free 
directions for the iteration and data spaces by excluding some references. In 
this example, let i?i t i?[i(g)2,j0l], i ?2 T i?[i(g)l, j02]. 

If we decide to exclude B\, the communication free direction for partitioning 
the iteration space is (0,1) (data space partitioned column wise). If we decide 
to exclude B 2 , the iteration partitioning direction is (-1,1) (data partitioning 
along anti-diagonal). If we decide to exclude B 3 , one could partition iterations 
along the direction (-1,0) (data partition will be row wise). Please refer to 
figure 2.1 for details of iteration and data partitioning for (0,1) partitioning 
and figure 2.2 for (-1,0) partitioning. The iterations/data grouped together 
are connected by arrows which show the direction of partitioning. Figure 2.3 
shows details of iteration and data partitioning for (-1,1) direction vector. 



A 




0 1'^ N-2 N-1 23 N-1 N 



(a) Data Partition of ‘B’ (b) Iteration Partition 

Fig. 2.1. Iteration and data partitioning for direction vector (0,1) 
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A 




0 1*^ N-2 N-1 23 N-1 N 



(a) Data Partition of ‘B’ (b) Iteration Partition 

Fig. 2.2. Iteration and data partitioning for direction vector (-1,0) 





(a) Data Partition of ‘B’ (b) Iteration Partition 

Fig. 2.3. Iteration and data partitioning for direction vector (-1,1) 



Table 1 shows the volume of communication, the parallelism and the loop 
completion times for each of the above three cases. It is assumed that loop 
body takes unit time for execution; whereas, each ‘send’ operation is ci times 
expensive than loop body and each ‘receive’ operation is C 2 times expensive 
than loop body. We now show that if we carry out the iteration and data 
space partitioning in the following manner, we can do better than any of the 
above partitions. 

We first decide to partition along the direction (0, 1) followed by partition- 
ing along (0l, 0) on cyclical basis. If we partition this way, it automatically 
ensures that the communication along the direction (-1,1) is not necessary. In 
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other words, the iteration space is partitioned such that most of the iterations 
in the space reference the same data elements. This results in localization of 
most of the references to the same iteration space partition. Please refer to 
figure 2.4 for details about iteration and data space partitions using cyclic 
directions. 





(a) Data Partition of ‘B’ (b) Iteration Partition 

Fig. 2.4. Iteration and data partitioning for cyclical direction vector (0.1)/ 

(-1,0) 



In this partitioning scheme, whatever parallelism is lost due to sequential- 
ization of additional iterations is more than offset by elimination of additional 
amount of communication, thus improving overall performance. We, thus, 
compromise the parallelism to a certain extent to reduce the communication. 
However, the resulting partitions from this scheme are not well balanced 
with respect to total computation+communication per processor. If we per- 
form a careful data distribution and iteration partitions merging after this 
phase, this problem could be solved. Our objective in data distribution is to 
minimize communication overhead on larger partition and put the burden 
of larger overheads on smaller partitions so that each partition is more load 
balanced with respect to the total computation-|-communication. 

For this example, we demonstrate the superiority of our scheme over other 
schemes listed in Table 1 as follows: 

It is clear that our scheme has half the amount of communication vol- 
ume as compared to either of the (0,1), (-1,0) or (-1,1) (we determine the 
volume of communication by counting total non-local references - each non- 
local reference is counted as one send at its sender and one receive at its 
receiver). The total number of partitions is (N-1) in case of (0,1) or (-1,0), 
(2*N-3) in case of (-1,1) and (N-1) in case of (0,l)/(-l,0) cyclic partition- 
ing. Thus, (0,l)/(-l,0) has lesser amount of parallelism as compared to (-1,1) 
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partitioning, but it also eliminated communication by the same degree which 
is more beneficial since communication is more expensive than computation. 
As an overall effect of saving in communication and more balanced compu- 
tation+communication at every partition, the loop execution and thus the 
speedup resulting from our scheme is much superior to any of the above. The 
loop execution time given by (0,1) or (-1,0) partitioning is : (N-1) -|- ci (2N-3) 
-|- C 2 (2N-3), that of (-1,1) is (N-1) -|- 2ci(N-2) -|- 2c2(N-2), whereas according 
to our scheme, the loop execution time is given by (2N-5)-|-2c2 {2N 0 5)/2^ 
-|- Cl {2N 0 5)/2^t can be easily seen by comparing the expressions that, 
the loop execution time given by our scheme is superior to any of the other 
ones if ci,C 2 > 1 (typically ci,C 2 i 1). Finally, it may be noted that our 
scheme, achieves an asymptotic speedup of (N/(2-|-2c2-|-ci)). In the modern 
systems with low communication/computation ratio, the typical values of ci 
and C 2 are few hundreds. For example, assuming ci and C 2 about 200, for N 
> 600, our scheme would give speedups > 1. This scheme, thus, results in 
effective parallelization of even medium size problems. Also, one can overlap 
the computation and communication on an architecture in which each PE 
node has a separate communication processor. In this case, the loop com- 
pletion time can approach the ideal parallel time since most communication 
overheads are absorbed due to overlap with the computation. 

The above scheme, however, results in a large number of partitions that 
could lead to two problems in mapping them. The first problem is that the 
partitions could be too fine grained for a given architecture and the second 
problem could be that the number of available processors may be much lesser 
than the number of partitions (as usually the case). In order to solve these 
problems, we perform architecture dependent analysis after iteration and 
data space partitioning. We construct a Partition Interaction Graph from 
the iteration space partitions and optimize by merging partitions with respect 
to granularity so that communication overheads are reduced at the cost of 
coarser granularity. We then load balance the partitions with respect to total 
execution time consisting of computation+communication times and finally 
map the partitions on available number of processors. We now present an 
overall outline of our approach. 



2.2 Our Approach 

Figure 2.5 shows the structure of our DOALL partitioner and scheduler. It 

consists of five phases: 

— Code Partitioning Phase : This phase is responsible for analyzing the refer- 
ences in the body of the DOALL loop nest and determine a set of directions 
to partition the iteration space to minimize the communication by mini- 
mally trading the parallelism. 

— Data Distribution Phase: This phase visits the iteration partitions gener- 
ated above in the order of decreasing sizes and uses a larger partition owns 
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Partitions Mapped to P Processors 



Fig. 2.5. DOALL Partitioner and Scheduler 
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rule to generate the underlying data distribution so that larger compute 
intensive partitions incur lesser communication overhead and vice-versa. 
The larger partition owns rule says that if the same data item is referenced 
by two or more partitions, the largest partition owns the data item. The 
goal is to generate computation+communication load balanced partitions. 

— Granularity Adjustment Phase: This phase analyzes whether the granu- 
larity of the partitions generated above is optimal or not. It attempts to 
combine two partitions which have a data communication and determines 
if the resulting partition is better in terms of completion time. It continues 
this process until the resulting partition has a worse completion time than 
any of the partitions from which it is formed. In this manner, a signifi- 
cant amount of communication is eliminated by this phase to improve the 
completion time. 

— Load Balancing Phase: This phase attempts to combine the load of several 
lightly loaded processors to reduce the number of required processors. Such 
merging is carried out only to the extent that the overall completion time 
does not degrade. 

— Mapping Phase: This phase is responsible for mapping the partitions from 
the previous phase to a given number of processors by minimally degrading 
the overall completion time. The partitions that minimally degrade the 
completion time on merger are combined and the process is continued till 
the number of partitions equal the number of available processors. 

The first two phases are architecture independent and the last three 
phases are architecture dependent which use the architecture cost model 
to perform granularity adjustment, load balancing and mapping. We first 
develop the theory behind code and data partitioning phases shown in fig- 
ure 2 . 5 . 



3. Terms and Definitions 

We limit ourselves to the perfectly nested normalized DOALL loops. We de- 
fine each of the occurrences of a given variable in the loop nest as its reference 
instance. For example, different occurrences of a given variable ‘B’ are defined 
as the reference instances of ‘B’ and different instances are denoted as B\, B2, 
B3, ..., etc. for convenience. The iteration space of n-nested loop is defined as 
I = %ii,i2,i3, \ij \Uj, 1 \j \n<}, where ii,i2, ■■■,in are different 

index variables of the loop nest. In the loop body, an instance of a variable ‘B’ 
references a subset of the data space of variable ‘B’. For example, the instance 
i?i I -|- cr^, *2 + 0-2, is -h cTg...], references the data space of matrix B de- 
cided by the iteration space as dehned above and the the offsets a\. crl, ■■■■ 
Each partition of the iteration space is called the iteration block. In order to 
generate communication free data and iteration partition, we determine par- 
titioning directions in the iteration and data spaces such that the references 
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generated in each iteration block can be disjointly partitioned and allocated 
on local memory of a processor to avoid communication. Although most of 
the discussion in this paper uses constant offsets for the variable references in 
each dimension, in general, the references can be uniformly generated [6] so 
that it is possible to perform eommunication free partitioning analysis. Please 
note that our approach uses eommunication free partitioning analysis as the 
underlying method (as described in later sections); thus, the underlying as- 
sumptions and restrictions are the same as any of those methods described 
in literature [1,6, 14,22], All of these methods are able to handle uniformly 
generated references; thus, our method is able to do the same. 

A set of reference instances of a variable is called the instance set of that 
variable. A set of reference instanees of a variable for which communication 
free data and iteration partition can be determined is defined as a set of 
compatible instances of a variable. If a communication free partition can not 
be found, such a set of reference instances is called a set of incompatible 
instanees. If a communication free partition can be determined for a set 
of variables considering all their instances, it is called as a set of compatible 
variables; otherwise it is called as a set of incompatible variables. In this paper, 
we focus on minimizing the communication when we have a set of incompatible 
instanees of a variable so that a communication free partition can not be 
found. Minimizing communication for multiple incompatible variables is even 
more hard and is not attempted here. 



3.1 Example 

Consider the following code: 

for i := 1 to N 
for j := 1 to N 

a[i,j] := (b[i,j] + b[i-l,j-l]+ b[i-l,j] 

+ b[i-l,j+l] + b[i,j-l] 

+ b[i,j+l] + b[i+l,j-l] 

+ b[i+l,j] + b[i+l, j+1] )/9 

endf or 
endf or 

For this code, it is not possible to determine a communication free iteration 
and data partitioning direction. Let 6i | b[i,j], 62 T ^ l^i Ij; ^3 T 
6[i0l,j],64t 6[i0l, j-hl], 65 T b[i,j®l],be 'I 6[f, j-hl], 67 T b[i+l,j'g)l],b8 't 
b[i + l,j],bg'\ b[i + l,j+V\. Thus, the instance set for the variable b is given 
by b 2 , ..., bgfffor the nine occurrences of b. All these reference instances 
are, therefore, ineompatible. 
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4. Problem 

We begin by stating the problem of communication minimization for incom- 
patible instances of a variable as follows: 

Given an instance set of a variable B, denoted by Sb = i? 2 , B 3 , 

which may comprise of ineompatible instanees occurring within a loop nest 
as described before, determine a set of communication minimizing directions 
so that the volume of communication reduced is at least equal to or more 
than the parallelism reduced. We measure the volume of communication by 
the number of non-local references (the references which which fall outside 
the underlying data partition) corresponding to an iteration block. In our 
formulation of the problem, no data replication is allowed. There is only one 
copy of the each array element kept at one of the processors and whenever 
any other processor references it, there is a communication : one send at the 
owner processor and one receive at the one which needs it. The justification 
for reducing the above volume of communication is that the data communica- 
tion latency in most distributed memory systems consists of a fixed start-up 
overhead to initiate communication and a variable part proportional to the 
length of (or to the number of data items) the message. Thus, reducing the 
number of non-local data values, reduces this second part of communication 
latency. Of course, one may perform message vectorization following our par- 
titioning phase to group the values together to be sent in a single message to 
amortize on start-up costs. Such techniques are presented elsewhere [19] and 
do not form a part of this paper. 

We measure the amount of parallelism reduced by the number of ad- 
ditional iterations being introduced in an iteration block to eliminate the 
communication. 

4.1 Compatibility Subsets 

We begin by outlining a solution which may attain the above objective. We 
first partition the instanee set of a variable, Sb, into p subsets 5']j, S'^, ..., 
which satisfy the relation: 

— All the reference instances of the variable belonging to a given subset are 

compatible so that one can determine a direction for communication free 
partitioning. Formally, => S'g, ft'(dj , ^ 2 ) •••) such that partitioning 
along direction vector {d\,d 2 , achieves communication free partition, 

where, 1 \j \/o. 

— At least one reference instance belonging to a given subset is incompatible 

with all the reference instances belonging to any other subset. Formally, 
'i\Bi ^ so that it is incompatible with all Bi ^ S^, where, j ^ k, 
1 \i j,k \ p. In other words, one can not find a communication free 
partition for for some Bi 
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It is easy to see that the above relation is a compatibility relation. It is well 
known that the compatibility relation only defines a covering of the set and 
does not define mutually disjoint partitions. We, therefore, first determine p 
maximal compatibility subsets : S]^, S '^, from the above relation. For 
each of the maximal compatibility subsets, there exists a direction for commu- 
nication free partitioning. The algorithm to compute maximal compatibility 
subsets is described in the next section. Following Lemma summarizes the 
maximum and minimum number of maximal compatibility subsets that can 
result from the above relation. 

Lemma 1 : If m | #s46nd if p maximal compatibility subsets result from 
the above relation on S's, then 2\p \C™. 

Proof : 

It is clear that there must exist at least one Bi =^Sb, such that it is not 
compatible with 5's (g) If this is not the case, then communication free 

partition should exist for all the instances belonging to Sb, which is not true. 
Thus, minimum two compatibility subsets must exist for Sb- This proves the 
lower bound. 

We now show that any two reference instances, Bi,Bj =^Sb are always 
compatible. Let (crj, (T 2 , ..., cr*) and (cr^, cr^, •••, be the two offsets corre- 
sponding to instances Bi and Bj respectively. Thus, if we partition along the 
direction (crj ® (g) ..., cr* 0 cr^) in the iteration and data space of B, 

we will achieve the communication free partitioning as far as the instances 
Bi and Bj are concerned. Thus, for any two instances, communication free 
partitioning is always possible proving that they are compatible. The number 
of subsets which have two elements ot Sb are given by C™, proving the upper 
bound on p. 

q.e.d 



The bounds derived in the above lemma allow us to prove the overall 
complexity of our Communication Minimizing Algorithms discussed later. 

The next step is to determine a set of cyclically alternating directions from 
the compatibility subsets found above to maximally cover the communication. 



4.2 Cyclic Directions 

Let the instance set Sb for a variable B be partitioned into S^, S'^, S^ 
which are maximal compatibility subsets under the relation of communication 
free partitioning. Let Comp(B) be the set of communication free partitioning 
directions corresponding to these compatibility subsets. Thus, Comp(B) = 
, DP(}, where, = (dj, •••; is the direction of communica- 

tion free partitioning for the subset S^. The problem now is to determine a 
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subset of Comp(B) which maximally covers^ the directions in Comp(B) as 
explained below. Let such a subset of Comp(B) be denoted by Cyclic(B). Let, 
Cyclic(B) = where, D'^ or f | 7r(j) 

defines a permutation which maps jth element of Comp(B) at ith position 
in Cyclic(B). We now state the property which allows us determining such a 
maximal, ordered subset Cyclic(B) of Comp(B): 

Property 1 : The subset Cyclic(B) must satisfy all of the following: 

1. D'^ bi) = /j’T 3 

D'^ (i) direction is then said to be covered by directions D'^ (i®i) and 

D'^ (i® 2 )^ Thus, each of the elements of the ordered set Cyclic(B) must 

be covered by the previous two elements, the exception being the first 
two elements of Cyclic(B). 

2. Consider Comp(B) - Cyclic(B), and let some belong to this set. If 

= Cl \D'^ b<) -|- where, 1 \ (t 0 1), ci =>/+ (in 

other words, if the direction can be expressed as a linear combination 
of multiple of D'^ b) and a summation of a subset of ordered directions 
as above) then it is covered and there is no communication along it. 
Let Uncov(B) be the subset of Comp(B) - Cyclic(B) such that <t£)^ => 
Uncov(B), ^ Cl \D^ b) _|_ b)^ none of its elements 

is covered and let s | ^ncov{B)^ 

3. Cyclic(B) is that subset of Comp(B) which satisfying the properties 
stated in 1 and 2 as above leads to minimum s. 

Stated more simply, Cyclic(B) is an ordered subset of Comp(B) which 
leaves minimum number of uncovered direction in Comp(B). If we deter- 
mine Cyclic (B) and follow the corresponding communication free directions 
cyclically from b) to (<®i) (such as D'^ 

D'^ (1)^ ]j'^ b),..), communication is reduced by a larger degree than loss 

of parallelism which is beneficial. The following Lemma formally states the 
result: 

Lemma 2 : If we follow iteration partitioning cyclically along the directions 
corresponding to Cyclic(B) as above, for each basic iteration block (basic 
iteration block is achieved by starting at a point in iteration space and by 
traversing once along the directions corresponding to Cyclic(B) from there), 
parallelism is reduced by (t-1) (due to sequentialization of (t-1) iterations) 
and the communication is reduced by (p-|-t)-(s-|-3), where p | J(i^omp{B)^ 
1 1 #lycbc(ZZ)#and s ] ^ncov{B)Jlt 

Proof : 

It is easy to see that if Z | ^yclic{B)Jftwe traverse once along the cor- 
responding directions and thus, introduce (t-1) extra iterations in a basic 



^ A given direction is said to be covered by a set of directions, iff partitioning 
along the directions in the set eliminates the need for communication along the 
given direction 
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iteration block reducing the parallelism appropriately. So, we prove the re- 
sult for communication reduction. 

It is obvious that if we traverse the iteration space along (t-1) directions 
corresponding to the ordered set Cyclic(B), the communication is reduced 
by (t-1). In addition to this, since the Property 1, condition 1 is satisfied 
by these directions, additional (t-2) directions are covered eliminating the 
corresponding communication. In addition to this, the partitioning is also 
capable to covering all the directions : ci D'^ , where, 

1 \j \(t(S>l), Cl =>/+, Property 1 Condition 2. These directions are the ones 
which correspond to Comp{B) 0 Cyclic{B) 0 Uncov{B). Thus, the number 
of such directions is (p - 1 - s). Thus, the total number of directions covered = 
(t-l)-|-(t-2)-|-(/9 - 1 - s) = (p-|-t) - (s-|-3). Thus, in one basic iteration partition, 
one is able to eliminate the communication equal to {{p -f t) - (s -|- 3)) by 
reducing parallelism by an amount (t-1). 

q.e.d 

Corollary 1 : According to the above lemma, we must find at least one 
pair of directions which covers at least one other direction in Comp(B) to 
reduce more communication than parallelism. 

Proof : 

The above Lemma clearly demonstrates that, in order to reduce more 
communication than parallelism, we must have (p -f t) 0 (s -|- 3) > (t 0 1), or, 
(p0s) > 2. Now, Comp(B) = Cyclic(B) -|- Cov(B) -|- Uncov(B), where Cov(B) 
is the set of directions covered as per condition 2 in Property 1. In other 
words, p = t -h q -f s, where #lou(i?)4>T Q- Thus, (p 0 s) ^3 oc {t + q) ^3. 
At its lowest value, (t-|-q) = 3. Consider following cases for (t-|-q) = 3: 

1. t = 0, q = 3 : This is impossible since if Cyclic(B) is empty, it can not 
cover any directions in Comp(B). 

2. t = 1, q = 2: This is also impossible since one direction in Cyclic(B) can 
not cover two in Comp(B). 

3. t = 2, q = 1: This is possible since two directions in Cyclic(B) can cover 
a direction in Comp(B) through Property 1, condition 2. 

4. t = 3, q = 0: This is also possible, since Cyclic(B) would then have three 
elements related by Property 1, condition 1. 

It can be seen that only cases (3) and (4) above are possible and each one 
would imply that a direction in Comp(B) is covered by Cyclic(B) either 
through condition 1 or through condition 2 of Property 1. Thus, the result. 

q.e.d 

Thus, in order to maximally reduce communication, we must find Cy- 
clic(B) from Comp(B) so that it satisfies Property 1. As one can see, the 
directions in Cyclic(B) form a Fibonacci Sequence as per Property 1 maxi- 
mally covering the remaining directions in Comp(B). Our problem is, thus, 
to find a maximal Fibonacci Sequence using a minimal subset of Comp(B). 
The algorithm to determine such a subset is discussed in the next section. 
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5. Communication Minimization 

In this section, we discuss the two algorithms based on the theory developed 
in last section. The first algorithm determines the maximal compatibility sub- 
sets of the instance set of a given variable and the second one determines 
a maximal Fibonacci Sequence as discussed in the last section. We also an- 
alyze the complexity of these algorithms. For illustration of the working of 
this algorithm, please refer to the example presented in section 7. 

5.1 Algorithm : Maximal Compatibility Subsets 

This algorithm finds the maximal compatibility subsets, Comp(B) of a variable 
B, given the instance set Sb as an input. 

As one can see that the compatibility relation of communication free parti- 
tioning for a set of a references (defined before) is reflexive and symmetric but 
not necessarily transitive. If a and b are compatible, we denote this relation 
as a / b. 

1. Initialize Comp(B) := k := 1. 

2. for every reference instance Bi ^Sb do 

a) Find Bj ^Sb such that Bi / Bj but both Bi,Bj for 1 \ 

p < k. In other words, find a pair of references such that it has not 
been put into some compatibility subset already constructed so far 
(where k-1 is the number of compatibility subsets constructed so far). 
Whether or not Bj / Ba can be determined by algorithms described 
in [1,6,14,22]. 

b) Initialize := ^Bi, 73^0 (put the pair satisfying above property into 
a new subset Sg being constructed). 

c) For every Bi ^{Sb S’^), do 

- if Bi / B^, := ^^BiO 

— Add the constructed subset to Comp(B), Comp{B) := 
Comp{B) k := k-|-l. 

d) Repeat steps (a) through (c) above till no Bj can be found satisfying 
condition in (a). 

3. After all the subsets are constructed, replace each of them by the 
corresponding communication free partitioning directions. That is, for 
Comp(B) constructed above, replace each 5'^ by D^, where, is the 
corresponding communication free direction for 

As one can see that the above algorithm checks for compatibility relation 
from an element of Sb to all the other elements oi Sb and therefore, its 
worst case complexity 0 (#b<I|). 
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5.2 Algorithm : Maximal Fibonacci Sequence 

Following algorithm determines the set Cyclic(B) using Comp(B) as an input. 

1. Sort the set Comp(B). If DP(}is the sorted set, it must satisfy 

the following order: 

- D\ < or 

- if D* := for all j such that 1 \k and for some 

k, such that 1 \k \r 0 1. 

The elements , D ^, are then said to be sorted in non-decreasing 
order < such that D 1 

2. Initialize set MaxFib := (f>, max := 0. 

3. for i :=1 to n 
for j := i+1 to n 

a) Let D := . Initialize last := j, Fib := 4>, k := j+1. 

b) while < D) 

k := k+1 

c) if = D), Fib := Fib^D'^, D := last := k, k:=k-fl. 

d) Repeat steps (b) and (c) above till k > n. 

e) Let q be the number of additional directions covered in Comp(B) 

by Fib as per Property 1. In other words, let D =^Comp{B) 0 Fib. 
If Z? = Cl where, 1 \ u \JIFibJ(tci ^/+, D 

is already covered by Cyclic(B). Determine q, the number of such 
covered directions in Comp(B) - Fib. 

f) if max < ^ibJIti- q, MaxFib := Fib, max := J|^ibi|^{- q. 

4. Cyclic(B) := MaxFib. 

As one can see, the sorting step for the above algorithm would require 
0{p log p) and the step of finding the maximal cover would require 0{p^). 
Thus, the total complexity of the algorithm is 0{p log p + p^). From 
Lemma 1, since p \, the overall complexity of the algorithm is 

log^B^ 

The code partitioning phase (refer to figure 2.5) uses these two algo- 
rithms to determine a set of communication minimizing directions (given by 
Cyclic(B)) for iteration space partitioning. 



5.3 Data Partitioning 

The next phase is data partitioning. The objective of the data distribution 
is to achieve computation+ communication load balance through data dis- 
tribution. This phase attempts minimization of communication overhead for 
larger compute intensive partitions by localizing their references as much as 
possible. In order to determine the data partition, we apply the following 
simple algorithm which uses a new larger partition owns rule: 
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— Sort the partitions in the decreasing order of their sizes in terms of the 
number of iterations. Visit the partitions in the sorted order (largest to 
smallest) as above. For each partition do: 

— Find out all the data references generated in a given partition and allocate 
that data to the respective processor. If the generated reference is already 
owned by a larger partition generated previously, add it to the set of non- 
local references. 



6. Partition Merging 

The next step in compilation of the DOALL loops is to schedule the parti- 
tions generated above on available number of processors. For scheduling the 
partitions generated by the iteration partitioning phase on available number 
of processors, first a Partition Interaction Graph is constructed and granu- 
larity adjustment and load balancing are carried out. Then the partitions are 
scheduled (mapped) on a given number of available processors. 

Each node of the partition interaction graph denotes one loop partition 
and the weight of the node is equal to the number of iterations in that loop 
partition. There is a directed edge from one node to another which represents 
the direction of data communication. The weight of the edge is equal to the 
number of data values being communicated. Let G(V, E) denote such a graph 
where V is the set of nodes and E is the set of edges as described above. Let 
t{vi) denote the weight of node ti and c{vj,Vi) denote the weight of edge 

(Vj , Vi') =^E . 

The following is supposed to be the order of execution of each partition: 

— Send : The partition first sends the data needed by other partitions. 

— Receive : After sending the data, the partition receives the data it needs 
sent by other partitions. 

— Compute : After receiving the data in the above step, the partition executes 
the assigned loop iterations. 

The total time required for the execution of each partition is, thus, equal to 
Send time -|- Receive time -|- Compute time. The Send time is proportional to 
the total number of data values sent out (total weight ) on all outgoing edges 
and the receive time is proportional to the total number of data values re- 
ceived (total weight) on all incoming edges. The compute time is proportional 
to the number of iterations (node weight). 

Depending on the relative offsets of the reference instances between differ- 
ent partitions and the underlying data distribution, the data values needed 
by a given partition may be owned by one or more partitions. This communi- 
cation dependency is denoted by the graph edges and the graph may contain 
a different number of edges depending on such dependency. The length of 
the longest path between Vi and Vj is defined as the communication distance 
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between Vi and Vj where {vi,Vj) ^E. For example, in figure 6.1, the com- 
munication distance for the edge {vk, Vi) is equal to two due to the fact that 
(vk,Vi) =>£' and the longest path from Vk to Vi is of length two. 

It can be shown that due to the properties of partitioning method de- 
scribed in the last section (proof omitted due to lack of space), the following 
relationships hold good (refer to figure 6.1): 

— The weight of a given node is less than or equal to any of its predecessors. 
In other words, t{vi) \t{vj) where {vj,Vi) ^E. 

— The weight of an edge incident on a given node is more than or equal to 
the weight of an outgoing edge from that node for the same communication 
distance. In other words, c{vk,vj) y c{vj,Vi) where both the edges repre- 
sent the same communication distance. This relationship does not apply 
to two edges representing two different communication distances. 

We now describe three scheduling phases as outlined before. All of these 
heuristics traverse the partition interaction graph in reverse topological order 
by following simple breadth first rule as follows: 

— Visit the leaf nodes of the graph. 

— Visit the predecessor of a given node such that all of its successors are 
already visited. 

— Follow this procedure to visit backwards from leaf nodes till all the nodes 
including root node are visited. 




Fig. 6.1. Portion of Partition Interaction Graph 



The complexity of each of these phases is 0(J|F ^ where V is the number 
of nodes in the graph. 
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6.1 Granularity Adjustment 

Refer to figure 6.1. 

— Calculate the completion time of each node Vj given by tcom{vj) = ki \ 

Ec{vj,Vi) + k 2 \ + where the cost of one 

iteration is assumed to be 1 and the cost of one send is assumed to be ki 
and that of one receive to be k 2 ■ 

— Visit the nodes of the graph in the reverse topological order described as 
above. Suppose we choose a predecessor Vk of node vj for merging to adjust 
granularity. 

— Determine the completion time of merged node vjk given by tcom{vjk) = 
tcom{vj) + tcom{vk) 0 c{vj,Vk) \(fci + fe). 

— Compare it with each of tcom{vj) and tcom{vk) and if tcom{vjk) is lesser 
than both, merge vj and Vk- 

— Continue the process by attempting to expand the partition by considering 
Vjk and a predecessor of Vk next and so on. 

— If tcom{vjk) is greater than either of tcom{vj) or tcom{vk), reject the 
merger of Vj and Vk- Next, attempt merger of Vk and one of predecessors 
and so on. 

— Repeat all the steps again on the new graph resulting from the above 
procedure and iterate the procedure until no new partitions are merged 
together (condition of graph invariance). 

6.2 Load Balauciug 

Refer to figure 6.1. 

— Let T be the overall loop completion time generated by the above phase. 

— Visit the nodes of the graph in the reverse topological order described 
as above. Suppose we choose a predecessor Vk of node Vj to merge the 
partitions. 

— Determine the completion time of merged node Vjk = tcom{vj)+tcom{vk)® 
c{vj,Vk) V(fci + /C 2 ). Obviously, tcom{vjk) will be higher than that of either 
of tcom{vk) or tcom{vj) since if it were not the case, the two partitions 
would have been merged by the granularity adjustment algorithm. 

— Compare tcom{vjk) with T and if tcom{vjk) is lesser than T, merge Vj and 
Vk- 

— Continue the process by attempting to expand the partition by considering 
Vjk and predecessor of Vk next and so on. 

— If Vjk is greater than T, reject the merger of Vj and Vk- Next, attempt 
merger of Vk and one of its predecessor and so on. 

— Keep repeating this process and if at any stage the completion time of the 
merged node is worse than the overall completion time T, reject it and 
attempt a new one by considering predecessor and its predecessor and so 



on. 
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6.3 Mapping 

Refer to figure 6.1. 

— Let there be P available processors on which the partitions resulting from 
previous phase are to be mapped, where # partitions > P. 

— Traverse the graph in the reverse topological order as described earlier. 
Suppose we choose a predecessor Vk or a node Vj for a possible merge to 
reduce the number of processors. 

— Determine the completion time of merged node Vjk'- tcom{vjk) = tcom{vj)+ 
tcom{vk) 0 c{vj,Vk) \(A:i + /C 2 ). Obviously, tcom{vjk) will be higher than 
the loop completion time T. Store the tcom{vjk) in a table. 

— Attempt the merger of another node and its predecessor and store it in 
a table. Repeat this process for all the nodes and choose the pair which 
results in minimum completion time when merged and combine them. This 
reduces the number of partitions by 1. 

— Continue the above process till the number of partitions is reduced to P. 



7. Example : Texture Smoothing Code 

In this section, we illustrate the significance of the above phases using a 
template image processing code. This code exhibits a very high amount of 
spatial parallelism suitable for parallelization on distributed memory systems. 
On the other hand, this code also exhibits a high amount of communication 
in all possible directions in the iteration space. Thus, this code is a good 
example of the tradeoff between the parallelism and the communication. 

An important step in many image processing applications is texture 
smoothing which involves finding the average luminosity at a given point 
in a image from its immediate and successive neighbors. 

Consider the following code: 

for i := 1 to N 
for j := 1 to N 

a[i,j] := (b[i,j] + b[i-l,j-l]+ b[i-l,j] + b[i-l,j + l] 

+ b[i,j-l] + b[i,j+l] 

+ b[i+l,j-l] + b[i+l,j] + b[i+l, j+1] )/9 

endf or 

endf or 

The above code finds the average value of luminosity at a grid point (i,j) 
using its eight neighbors. In this code, every grid point is a potential candidate 
for parallelization; thus, the code exhibits a very high amount of parallelism. 
On the other hand, if we decide to parallelize every grid point, there would 
be a tremendous amount of communication in all possible directions. Thus, 
we apply our method to this application to demonstrate that we can achieve 
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a partition which maximally reduces communication by minimally reducing 
the parallelism. 

Let 5i t b[i,j],b 2 ^ 6[i (g) 1, j 0 1], 63 | b[i(gil,j], b^ 6[i0 1, j + 1], 65 | 

b[i,j'S)l],be'\ b[i,j + l],b7-\ 6[i + l, j0l], 63 T b[i + l, j], bg 1 b[i + l,j + l]. 

Thus, the instance set for the variable b is given by 1|6i, 62 , 5g0for the nine 
occurrences of b. Obviously, no communication free partition is possible for 
the above set of references. The first step, therefore, is to determine maximal 
compatibility subsets of the instance set. 

In order to determine the maximal compatibility subsets, we follow the 
algorithm described in section 5.1. We begin by considering the compatibility 
subset involving b^. We try to group 61 with 62 to create a subset ^bi,b2'0'- 
The direction for communication free partitioning for this subset is (1,1), 
and thus, we can not add any other reference of b to this subset except 5g 
since adding any other reference would violate the condition for communi- 
cation free partitioning. Thus, one of our maximal compatibility subsets is 
62, bg'O'. Next, we group 61 with 63 and add 63 to it to give ^61, 63, bs(}as 
another compatibility subset with (1,0) as direction of communication free 
partitioning. Similarly, we try to group b\ with other elements so that bi and 
that element are not together in any subset formed so far. Thus, the other 
subsets resulting from 61 are ^1, 64, ^r'Oand ^61, 65, with (1,-1) and (0,1) 
as the directions for communication free partitioning. Next, we follow the 
algorithm for b 2 - We already have ^6i,52<)>in one of the subsets constructed 
so far; thus, we start with ^62, bsf}. The direction for communication free par- 
titioning is (0,1) in this case and we can include only 64 in this subset. Thus, 
we get ^62, 63, 64<)as another maximal compatibility set. 

By following the algorithm as illustrated above, the following are the 
maximal compatibility subsets found (directions for communication free par- 
titions are shown next to each of them) : 

- 5i : %bi,b 2 ,bg(}{l,l), ^61, 63, 53<)(1,0), ^61,64,670(1,-1), ^61,65,630(0,1). 

- 62 : ^62,63,640(0,1), ^62,65,670(1,0), ^62,630(1,2), 5^*2, 6s0 (2,1). 

“ 63 : ^63,650(1,-1), 1163,630(1,1), 1163,6702,-1), II63, 6gO(2,l). 

- 64 : 1164,650(1,-2), H64, 63, 6gOl,0), 1[64, 630(2,-!). 

“ 65 : 1165,6301,1), 1[65, 6gO(l,2). 

- 63 : 1163,6701,-2), 1[6b, 630(1,-1)- 

- 67 : H67, 63, 6g0(0,l). 

Next step is to determine the set Comp(b) which is a collection of com- 
munication free directions corresponding to each one of the maximal com- 
patibility subsets. Thus, Comp(b) = 1|(0,1), (1,-2), (1,-1), (1,0), (1,1), (1,2), 
(2,-1), (2,1)0 The next step is to determine Cyclic(b) to maximally cover 
the directions in Comp(b). We, thus, apply the algorithm in section 5.2. We 
begin by considering (0,1) and (1,-2) which add up to (1,-1). Thus, we include 
(1,-1) in the set Fib being constructed. If we try adding (1,-2) and (1,-1), it 
gives (2,-3) which is not a member of Comp(b). Thus, we stop and at this 
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Table 7.1. Fibonacci Sets Constructed by Algorithm in Section 5.2 



Fibonacci 


Directions 


Parallelism 


Communication 


Set (Fib) 


Covered 


Reduced 


Reduced 


to, 1), (1,02), (1,01)0 


(1,0),(2,-1) 


2 


5 


to, 1), (1,01), (1,0), (2,01)0 


- 


3 


4 


to, 1), (1,0), (1,1), (2, 1)0 


- 


3 


4 


to, 1), (1,1), (1,2)0 


- 


2 


3 


tl, 02), (1,1), (2, 01)0 


- 


2 


3 


tl,0l), (1,0), (2,01)0 


- 


2 


3 


tl, 01), (1,2), (2,1)0 


- 


2 


3 


tl,0),(l,l),(2,l)O 


- 


2 


3 



point, Fib = ^(0, 1), (1, (S>2), (1, 0l)O The next step in the algorithm is to de- 
termine the other directions in Comp(b) which are covered by this iteration 
partition. Following the step 3.e of the algorithm, if we add (1,-1) and (0,1) 
it gives (1,0) and if we add 2*(1,-1) and (0,1) it gives (2,-1); thus, a linear 
combination of (1,-1) and (0,1) covers two other direction in Comp(b). In this 
case, we are able to eliminate communication equal to 5 by sequentializing 2 
iterations. 

Next, we try to construct another set Fib by starting with (0,1) and (1,- 
1) and following the procedure of adding them and checking if it covers any 
direction in Comp(b). If it does, then we add it to Fib and continue further 
by forming the Fibonacci series using two most recently added elements in 
the set. Finally, we find out the covered directions from the remainder of 
Comp(b), if any using the step 3.e of the algorithm. Table 2 shows the dif- 
ferent Fib sets constructed by the algorithm , the covered directions if any 
and the parallelism lost and communication saved. From the table, one can 
see that the algorithm would compute Cyclic(b) = ^(0, 1), (1, 02), (1, (g)l)'0> 
since it results in maximally reducing the communication. Thus, by cycli- 
cally following (0,l)/(l,-2) directions, one could reduce communication by 5 
losing parallelism by 2 per basic iteration block. This demonstrates that it 
is possible to profitably parallelize these type of applications by using our 
method. 

Once Cyclic(b) is determined, the next step is to generate the iteration 
and the data partition for the loop nest. We apply the algorithm of section 
5.3. We first determine (1,7) as the base point and then apply (0,l)/(l,-2) as 
directions cyclically to complete the partition. We then move along dimension 
2 (for this example, dimension involving ‘i’ is considered as dimension 1 and 
dimension involving ‘j’ as dimension 2) and carry out the partitioning in a 
similar manner. We get the iteration partitioning shown in Figure 7.1. The 
data partition is found by traversing the iteration partitions in the order : 0, 
4, 1, 5, 2, 6, 3 and 7 using largest partition owns rule. 
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Finally, the partition interaction graph of the partitions is generated as 
shown in figure 7.2. For this graph, only the communication distances 1 and 
2 exist between the different partitions (please see the definition of commu- 
nication distances in the preceding section). The total number of iterations 
in each partition (the computation cost) and the number of data values ex- 
changed between two partitions (the communication cost) are shown against 
each node and each edge in figure 7.2. Depending on the relative costs of 
computation and communication, the granularity adjustment, load balancing 
and mapping phases will merge the partitions generated above. The results of 
these phases for Cray T3D for problem size N=16 are discussed in section 8. 




Fig. 7.1. Iteration partition for texture smoothing code 



8. Performance on Cray T3D 

The following example codes are used to test the method on a Cray T3D 
system with 32 processors: 

Example I : 



for i = 2 to N 
for j = 2 to N 

for k = 1 to Upper 

A[i,j,k] = B[i-2,j-l,k]+B[i-l,j-l,k]+B[i-l,j-2,k] 
endf or 
endf or 
endf or 
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Fig. 7.2. Partition Interaction Graph of Iteration Partitions 
Excunple II: 



for i := 1 to N 
for j := 1 to N 
for k := 1 to Upper 

a[i,j,k] := (b[i,j,k] + b [i-1 , j-1 ,k] + b [i-1 , j ,k] 

+ b[i-l,j+l,k] + b[i,j-l,k] + b[i,j+l,k] 

+ b[i+l,j-l,k] + b[i+l,j,k] + b[i+l, j+l,k] )/9 

endf or 
endf or 
endf or 

In the above examples, there is an inner loop in k dimension. The number 
of iterations in this loop, Upper, are chosen as 10 million in case of Example 
I and 1 million in case of Example II to make computation in loop body 
comparable to communication. As one can see, there is no problem in terms 
of communication free partitioning in k dimension. However, in i and j di- 
mensions, due to reference patterns, no communication free partition exists 
and thus, the outer two loops (in i and j dimension) and the underlying data 
are distributed by applying the techniques described in sections 5 and 6. For 
Example I, the cyclical direction of partitioning are (0,l,0)/(-l,0,0) and for 
Example II, the cyclical directions are (0,l,0)/(l,-2,0) as explained earlier. 
The number partitions found by the method for each of these examples is 
equal to N , the size of the problem. Thus, the size of the problem N is 
appropriately chosen to match the number of processors. 

The method is partially^ implemented in the backend of Sisal (Streams 
and Iterations in a Single Assignment Language) compiler, OSC [7,32] tar- 

Some phases of the method are not fully automated yet 



2 
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geted for Cray T3D system. The method is tested for N=4 (4 processors), 
N=8 (8 processors), N=16 (16 processors) and N=32 (32 processors). The 
timings are obtained using clock() system call on Cray T3D which allows 
measuring timings in micro-seconds. PVM was used as underlying mode of 
communication. The sequential (as shown above) and the parallel versions 
are implemented and speedup is calculated as the ratio of time required for 
each. 



Table 8.1. Example I : Performance on Cray T3D 



Problem 

Size 


Processors 


Direction 


Sequential 
Time (sec) 


Parallel 
Time (sec) 


Speedup 


4x4 


4 


Cyclic 


15.8 


7.6 


2.08 


4x4 


4 


(0,1) 


15.8 


15.1 


1.05 


4x4 


4 


(-1,0) 


15.8 


15.52 


1.02 


8x8 


8 


Cyclic 


52.73 


16.3 


3.23 


8x8 


8 


(0,1) 


52.73 


29.07 


1.81 


8x8 


8 


(-1,0) 


52.73 


30.09 


1.75 


16x16 


16 


Cyclic 


213.9 


33.66 


6.35 


16x16 


16 


(0,1) 


213.9 


61.5 


3.47 


16x16 


16 


(-1,0) 


213.9 


63.3 


3.38 


32x32 


32 


Cyclic 


919.7 


68.42 


13.44 


32x32 


32 


(0,1) 


919.7 


113.44 


8.1 


32x32 


32 


(-1,0) 


919.7 


117.6 


7.82 



Table 8.2. Example II : Performance on Cray T3D 



Problem 

Size 


Processors 


Sequential 
Time (sec) 


Parallel 
Time (sec) 


Speedup 


4x4 


4 


9.7 


2.57 


3.77 


8x8 


8 


35.1 


9.78 


3.59 


16x16 


16 


130.92 


19.12 


6.8 


32x32 


32 


543.3 


43.24 


12.56 



Refer to Table 3 and 4 for the results for each example. It can be clearly 
seen that it is possible to effectively parallelize both of these examples which 
are quite demanding in terms of communication by employing our method. 
The speedup values are quite promising in spite of heavy inherent communi- 
cation in these applications. 
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We also implemented Example I using (0,1) (column-wise) and (-1,0) 
(row-wise) as directions of partitioning using ‘owner computes’ rule. The 
speedups obtained by using these directions are also shown in Table 3. It can 
be clearly seen that our method outperforms these partitioning by almost a 
factor of 2 in terms of speedups. 




Fig. 8.1. Partition Interaction Graph for Example II (N=16) 



Figure 8.1 shows the partition interaction graph of Example II for a prob- 
lem size N=16. The first phase (granularity adjustment) attempts to increase 
the granularity of the partitions by combining them as per algorithm in sec- 
tion 6.1. But no partitions are combined by this phase. In order to measure 
the performance after this stage, the partitions are mapped to the respective 
processors. The processor completion times are shown in figure 8.2. 




0 I I I ^ I I ^ I I I 

0123456789 

Processor number 



Fig. 8.3. Completion times for processors after load balancing 
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Number of processors 



Fig. 8.4. Completion times for variable number of available processors : 
P = 1 to P = 8 



Next, the load balancing phase attempts to reduce the number of required 
processors without increasing the completion time. The number of partitions 
reduced in this phase from 16 to 8. figure 8.3 gives the completion times of 
the respective processors. One can see that these processors are quite well 
load balanced. Finally, the mapping phase attempts to map these 8 partitions 
onto 8 or fewer processors. The completion times of these mappings for ^ 
processors = 8 through 1 are shown in figure 8.4. One can see that the method 
demonstrates an excellent linear scalability. 

8.1 Conclusions 

In this paper, we have presented a methodology for partitioning and schedul- 
ing (mapping) the DOALL loops in a communication efficient manner with 
following contributions: 

— Established theoretical framework for communication efficient loop parti- 
tioning applicable to a large class of practical DOALL loops. 

— Developed iteration partitioning method for these loops by determining 
cyclic directions of partitioning in each dimension. 

— Developed a new larger partition owns rule for data distribution for com- 
putation-hcommunication load balance. 
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— Developed methodologies for granularity adjustment, load balancing and 
mapping to significantly improve execution time and computation+com- 
munication load balance of each partition. 

— Experimentally shown that these methods give good speedups for problems 
that involve heavy inherent communication and also exhibit good load 
balance and scalability. 

The method can be used for effective parallelization of many practical loops 
encountered in important codes such as image processing, weather modeling 
etc. that have DOALL parallelism but which are inherently communication 
intensive. 
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Summary. For distributed-memory multicomputers, the quality of the data par- 
titioning for a given application is crucial to obtaining high performance. This task 
has traditionally been the user’s responsibility, but in recent years much effort has 
been directed to automating the selection of data partitioning schemes. Several re- 
searchers have proposed systems that are able to produce data distributions that 
remain in effect for the entire execution of an application. For complex programs, 
however, such static data distributions may be insufficient to obtain acceptable per- 
formance. The selection of distributions that dynamically change over the course 
of a program’s execution adds another dimension to the data partitioning problem. 
In this chapter we present an approach for selecting dynamic data distributions 
as well as a technique for analyzing the resulting data redistribution in order to 
generate efficient code. 



1. Introduction 

As part of the research performed in the PARADIGM (PARAllelizing com- 
piler for Distributed- memory General-purpose Multicomputers) project [4], 
automatic data partitioning techniques have been developed to relieve the 
programmer of the burden of selecting good data distributions. Originally, 
the compiler could automatically select a static distribution of data (using a 
constraint-based algorithm [15]) specifying both the configuration of an ab- 
stract multi-dimensional mesh topology along with how program data should 
be distributed on the mesh. 

For complex programs, static data distributions may be insufficient to 
obtain acceptable performance on distributed-memory multicomputers. By 
allowing the data distribution to dynamically change over the course of a 
program’s execution this problem can be alleviated by matching the data 
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fice of Naval Research Graduate Fellowship, and in part by the Advanced Research Projects 
Agency under contract DAA-H04-94-G-0273 administered by the Army Research office. We are 
also grateful to the National Center for Supercomputing Applications and the San Diego Super- 
computing Center for providing access to their machines. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 445-484, 2001. 
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Sequential Fortran 77 program 
no distribution or redistribution 




Optimized static HPF program Optimized dynamic FIPF program 
expiicit distribution and redistribution expiicit redistribution 

Fig. 1.1. Dynamic data partitioning framework 

distribution more closely to the different computations performed through- 
out the program. Such dynamic partitionings can yield higher performance 
than a static partitioning when the redistribution is more efficient than the 
communication pattern required by the statically partitioned computation. 
We have developed an approach [31] (which extends the static partitioning 
algorithm) for selecting dynamic data distributions as well as a technique for 
analyzing the resulting data redistribution [32] in order to generate efficient 
code. In this chapter we present an overview of these two techniques. 

The approach we have developed to automatically select dynamic distri- 
butions, shown in the light outlined region in Figure 1.1, consists of two main 
steps. The program is first recursively decomposed into a hierarchy of candi- 
date phases obtained using existing static distribution techniques. Then, the 
most efficient sequence of phases and phase transitions is selected taking into 
account the cost of redistributing the data between the different phases. 

An overview of the array redistribution data-flow analysis framework we 
have developed is shown in the shaded outlined areas of Figure 1.1. In addi- 
tion to serving as a back end to the automatic data partitioning system, the 
framework is also capable of analyzing (and optimizing) existing High Per- 
formance Fortran [26] (HPF) programs providing a mechanism to generate 
fully explicit dynamic HPF programs while optimizing the amount of data 
redistribution performed. 

The remainder of this chapter is organized as follows: related work is 
discussed in Section 2; our methodology for the selection of dynamic data 
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distributions is presented in Section 3; Section 4 presents an overview of the 
redistribution analysis framework and the representations used in its devel- 
opment; the techniques for performing interprocedural array redistribution 
analysis are presented in Section 5; results are presented in Section 6; and 
conclusions are presented in Section 7. 



2. Related Work 

Static Partitioning. Some of the ideas used in the static partitioning algo- 
rithm originally implemented in the PARADIGM compiler [17] were inspired 
by earlier work on multi-dimensional array alignment [29]. In addition to 
this work, in recent years much research has been focused on: performing 
multi-dimensional array alignment [5,8,25,29]; examining cases in which a 
communication- free partitioning exists [35]; showing how performance esti- 
mation is a key in selecting good data distributions [11,44]; linearizing array 
accesses and analyzing the resulting one-dimensional accesses [39]; applying 
iterative techniques which minimize the amount of communication at each 
step [2]; and examining issues for special-purpose distributed architectures 
such as systolic arrays [42]. 

Dynamic Partitioning. In addition to the work performed in static par- 
titioning, a number of researchers have also been examining the problem of 
dynamic partitioning. Hudak and Abraham have proposed a method for se- 
lecting redistribution points based on locating significant control flow changes 
in a program [22]. Chapman, Fahringer, and Zima describe the design of a 
distribution tool that makes use of performance prediction methods when 
possible but also uses empirical performance data through a pattern match- 
ing process [7]. Anderson and Lam [2] approach the dynamic partitioning 
problem using a heuristic which combines loop nests (with potentially dif- 
ferent distributions) in such a way that the largest potential communication 
costs are eliminated first while still maintaining sufficient parallelism. Bixby, 
Kennedy and Kremer [6,27], as well as Garcia, Ayguade, and Labarta [13], 
have formulated the dynamic data partitioning problem in the form of a 0-1 
integer programming problem by selecting a number of candidate distribu- 
tions for each of a set of given phases and constructing constraints from the 
data relations. More recently, Sheffier, Schreiber, Gilbert and Pugh [38] have 
applied graph contraction methods to the dynamic alignment problem to 
reduce the size of the problem space that must be examined. 

Bixby, Kremer, and Kennedy have also described an operational defini- 
tion of a phase which defines a phase as the outermost loop of a loop nest 
such that the corresponding iteration variable is used in a subscript expres- 
sion of an array reference in the loop body [6]. Even though this definition 
restricts phase boundaries to loop structures and does not allow overlapping 
phases, for certain programs, such as the example that will be presented in 
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Section 3.1, this definition is sufficient to describe the distinct phases of a 
computation. 

Analyzing Dynamic Distribntions. By allowing distributions to change 
during the course of a program’s execution, more analysis must also be per- 
formed to determine which distributions are present at any given point in 
the program as well as to make sure redistribution is performed only when 
necessary in order to generate efficient code. 

The work by Hall, Hiranandani, Kennedy, and Tseng [18] defined the 
term reaching decompositions for the Fortran D [19] decompositions which 
reach a function call site. Their work describes extensions to the Fortran D 
compilation strategy using the reaching decompositions for a given call site to 
compile Fortran D programs that contain function calls as well as to optimize 
the resulting implicit redistribution. As presented, their techniques addressed 
computing and optimizing (redundant or loop invariant) implicit redistribu- 
tion operations due to changes in distribution at function boundaries, but do 
not address many of the situations which arise in HPF. 

The definition of reaching distributions (using HPF terminology), how- 
ever, is still a useful concept. We extend this definition to also include distri- 
butions which reach any point within a function in order to encompass both 
implicit and explicit distributions and redistributions thereby forming the 
basis of the work presented in this chapter. In addition to determining those 
distributions generated from a set of redistribution operations, this extended 
definition allows us to address a number of other applications in a unified 
framework. 

Work by Coelho and Ancourt [9] also describes an optimization for re- 
moving useless remappings specified by a programmer through explicit re- 
alignment and redistribution operations. In comparison to the work in the 
Fortran D project [18], they are also concerned with determining which dis- 
tributions are generated from a set of redistributions, but instead focus only 
on explicit redistribution. They define a new representation called a redistri- 
bution graph in which nodes represent redistribution operations and edges 
represent the statements executed between redistribution operations. This 
representation, although correct in its formulation, does not seem to fit well 
with any existing analysis already performed by optimizing compilers and 
also requires first summarizing all variables used or defined along every pos- 
sible path between successive redistribution operations in order to optimize 
redistribution. Even though their approach currently only performs this anal- 
ysis within a single function, they do suggest the possibility of an extension 
to their techniques which would allow them to also handle implicit remapping 
operations at function calls but they do not describe an approach. 
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3. Dynamic Distribution Selection 

For complex programs, we have seen that static data distributions may be 
insufficient to obtain acceptable performance. Static distributions suffer in 
that they cannot reflect changes in a program’s data access behavior. When 
conflicting data requirements are present, static partitionings tend to be com- 
promises between a number of preferred distributions. Instead of requiring 
a single data distribution for the entire execution, program data could also 
be redistributed dynamically for different phases of the program (where a 
phase is simply a sequence of statements over which a given distribution is 
unchanged). Such dynamic partitionings can yield higher performance than 
a static partitioning when the redistribution is more efficient than the com- 
munication pattern required by the statically partitioned computation. 



3.1 Motivation for Dynamic Distributions 

Figure 3.1 shows the basic computation performed in a two-dimensional Fast 
Fourier Transform (FFT). To execute this program in parallel on a machine 
with distributed memory, the main data array. Image, is partitioned across 
the available processors. By examining the data accesses that will occur dur- 
ing execution, it can be seen that, for the first half of the program, data is 
manipulated along the rows of the array. For the rest of the execution, data is 
manipulated along the columns. Depending on how data is distributed among 
the processors, several different patterns of communication could be gener- 
ated. The goal of automatic data partitioning is to select the distribution 
that will result in the highest level of performance. 




Fig. 3.1. Two-dimensional Fast Fourier Transform 

If the array were distributed by rows, every processor could independently 
compute the FFTs for each row that involved local data. After the rows had 
been processed, the processors would now have to communicate to perform 
the column FFTs as the columns have been partitioned across the processors. 
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Conversely, if a column distribution were selected, communication would be 
required to compute the row FFTs while the column FFTs could be computed 
independently. Such static partitionings, as shown in Figure 3.1(a), suffer 
in that they cannot reflect changes in a program’s data access behavior. 
When conflicting data requirements are present, static partitionings tend to 
be compromises between a number of preferred distributions. 

For this example, assume the program is split into two separate phases; a 
row distribution is selected for the first phase and a column distribution for 
the second, as shown in Figure 3.1(b). By redistributing the data between the 
two phases, none of the one-dimensional FFT operations would require com- 
munication. From Figure 3.1, it can be seen how such a dynamic partitioning 
can yield higher performance if the dynamic redistribution communication is 
more efficient than the static communication pattern. 



3.2 Overview of the Dynamic Distribution Approach 

As previously shown in Figure 1.1, the approach we have developed to auto- 
matically select dynamic distributions, consists of two main steps. First, in 
Section 3.3, we will describe how to recursively decompose the program into 
a hierarchy of candidate phases obtained using existing static distribution 
techniques. Then, in Section 3.4 we will describe how to select the most ef- 
ficient sequence of phases and phase transitions taking into account the cost 
of redistributing the data between the different phases. 

This approach allows us to build upon the static partitioning tech- 
niques [15, 17] previously developed in the PARADIGM project. Static cost 
estimation techniques [16] are used to guide the selection of phases while 
static partitioning techniques are used to determine the best possible dis- 
tribution for each phase. The cost models used to estimate communication 
and computation costs use parameters, empirically measured for each target 
machine, to separate the partitioning algorithm from a specific architecture. 

To help illustrate the dynamic partitioning technique, an example pro- 
gram will be used. In Figure 3.2, a two-dimensional Alternating Direction 
Implicit iterative method^ (ADI2D) is shown, which computes the solution 
of an elliptic partial differential equation known as Poisson’s equation [14]. 
Poisson’s equation can be used to describe the dissipation of heat away from 
a surface with a fixed temperature as well as to compute the free-space po- 
tential created by a surface with an electrical charge. 

For the program in Figure 3.2, a static data distribution will incur a 
significant amount of communication for over half of the program’s execution. 
For illustrative purposes only, the operational definition of phases previously 
described in Section 2 identihes twelve different “phases” in the program. 

^ To simplify later analysis of performance measurements, the program shown 
performs an arbitrary number of iterations as opposed to periodically checking 
for convergence of the solution. 
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program ADI2d 










op.p 


ihase 


double precision u(N,N), uh(N,N). b(N,N), 


alpha 






do j = 2, N - 1 


31 




integer i, j, k 








uh(N - l,j) = uh(N - l,j) / b(N - l.j) 


32 


VI 


*** Initial value for u 


op.p 


ihase 




enddo 

do j = 2, N - 1 


33_ 

34 




do j = 1 , N 


1 






do i = N - 2, 2, -1 


35 




do i = 1 , N 


2 






uh(i,j) = (uh(i,j) + uh(i + l,j)) 




VII 


u(i,j) = 0.0 


3 
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/ b(i,j) 
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enddo 
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I 




enddo 
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u(l,j) = 30.0 


5 






enddo 
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u(n,j) = 30.0 


6 












enddo 
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*** Forward and backward sweeps along rows _ 












do j = 2, N - 1 
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*** Initialize uh 
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do i = 2, N - 1 
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do j = 1 , N 


8 


II 




b(i,j) = (2 + alpha) 
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do i = 1 , N 
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u(i,j) = (alpha - 2) * uh(i,j) 




VIII 


uh(i, j) = u(i,j) 


10 




& 


+ uh(i + l,j) + uh(i - l,j) 
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enddo 
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enddo 
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enddo 
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enddo 
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do i = 2, N - 1 


45 




alpha = 4 * (2.0 / N) 


13 






u(i,2) = u(i,2) + uh(i,l) 


46 


IX 


do k = 1, maxiter 


14 






u(i,N - 1) = u(i,N - 1) + uh(i,N) 


47 


*** Forward and backward sweeps along cols _ 






enddo 
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do j = 2, N - 1 
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do i = 2, N - 1 
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do j = 3, N - 1 
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b(i,j) = (2 + alpha) 
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do i = 2, N - 1 
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uh(i,j) = (alpha - 2) * u(i,j) 




III 




b(i,j) = b(i,j) - 1 / b(i,j - 1) 


51 




& + u(i,j + 1) + u(i,j - 
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u(i,j) = u(i,j) 






enddo 
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+ u(i,j - 1) / b(i,j - 1) 
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enddo 
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enddo 
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do j = 2, N - 1 
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enddo 
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uh(2,j) = uh(2,j) + u(l,j) 


22 


IV 




do i = 2, N - 1 


55 




uh(N - l,j) = uh(N - l,j) + u(N,j) 


23 






u(i,N - 1) = u(i,N - 1) / b(i,N - 1) 


56 


XI 


enddo 
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enddo 
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do j = N - 2, 2, -1 
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do j = 2, N - 1 


25 






do i = 2, N - 1 
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do i = 3, N - 1 
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u(i,j) = (u(i,j) + u(i,j + D) 




XII 


b(i,j) = b(i,j) - 1 / b(i - l,j) 


27 
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/ b(i,j) 
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uh(i, j) = uh(i,j) 
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enddo 
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& + uh(i - l,j) / b(i - 1, 
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enddo 
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enddo 
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enddo 
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end 


64 



















Fig. 3.2. 2-D Alternating Direction Implicit iterative method (ADI2D) 
(Shown with operational phases) 



These phases exposed by the operational definition need not be known for 
our technique (and, in general, are potentially too restrictive) but they will 
be used here for comparison as well as to facilitate the discussion. 

3.3 Phase Decomposition 

Initially, the entire program is viewed as a single phase for which a static 
distribution is determined. At this point, the immediate goal is to determine 
if and where it would be beneficial to split the program into two separate 
phases such that the sum of the execution times of the resulting phases is less 
than the original (as illustrated in Figure 3.3). Using the selected distribution, 
a communication graph is constructed to examine the cost of communication 
in relation to the flow of data within the program. 

We define a communication graph as the flow information from the de- 
pendence graph weighted by the cost of communication. The nodes of the 
communication graph correspond to individual statements while the edges 
correspond to flow dependencies that exist between the statements. As a 
heuristic, the cost of communication performed for a given reference in a 
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Phase 1 




Fig. 3.3. Phase decomposition 

statement is initially assigned to {reflected back along) every incoming depen- 
dence edge corresponding to the reference involved. Since flow information is 
used to construct the communication graph, the weights on the edges serve 
to expose communication costs that exist between producer/consumer rela- 
tionships within a program. Also since we restrict the granularity of phase 
partitioning to the statement level, single node cycles in the flow dependence 
graph are not included in the communication graph. 

After the initial communication cost, comm{j, ref), has been computed 
for a given reference, ref, and statement, j, it is scaled according to the 
number of incoming edges for each producer statement, i, of the reference. 
The weight of each edge, {i, j), for this reference, W {i, j, ref), is then assigned 
this value; 



Scaling and assigning initial costs: 

Txr/ dyncount{i) , , , , ,n 

W[t,j,ref) = <lratio{%,j)<comm{j,ref) (3.1) 

dyncount{P) 



i<.j P in^pred{j,ref) 
if>j P in^succ{j,ref) 

nestleveUi) -|- 1 
= neMij) + 1 



( 3 . 2 ) 



The scaling conserves the total cost of a communication operation for a given 
reference, ref , at the consumer, j, by assigning portions to each producer, 
i, proportional to the dynamic execution count of the given producer, i, di- 
vided by the dynamic execution counts of all producers. Note that the scaling 
factors are computed separately for producers which are lexical predecessors 
or successors of the consumer as shown in Equation (3.1). Also, to further 
differentiate between producers at different nesting levels, all scaling factors 
are also scaled by the ratio of the nesting levels as shown in Equation (3.2). 
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Once the individual edge costs have been scaled to conserve the total 
communication cost, they are propagated back toward the start of the pro- 
gram (through all edges to producers which are lexical predecessors) while 
still conserving the propagated cost as shown in Equation (3.3). 

Propagating costs back: 

W{i,j,ref) dyncount{i) -^ratio{i,j) W{j,k, ) (3.3) 

dyncount{P) ^ 

P in^pred{j, ) 

In Figure 3.4, the communication graph is shown for ADI2D with some 
of the edges labeled with the unsealed comm cost expressions automatically 
generated by the static cost estimator (using a problem size of 512 512 

and maxiter set to 100). For reference, the communication models for an 
Intel Paragon and a Thinking Machines CM-5, corresponding to the com- 
munication primitives used in the cost expressions, are shown in Table 3.1. 
Conditionals appearing in the cost expressions represent costs that will be 
incurred based on specific distribution decisions (e.g., P 2 > 1 is true if the 
second mesh dimension is assigned more than one processor). 

Once the communication graph has been constructed, a split point is 
determined by computing a maximal cut of the communication graph. The 
maximal cut removes the largest communication constraints from a given 
phase to potentially allow better individual distributions to be selected for 
the two resulting split phases. Since we also want to ensure that the cut 
divides the program at exactly one point to ensure only two subphases are 
generated for the recursion, only cuts between two successive statements will 
be considered. Since the ordering of the nodes is related to the linear ordering 
of statements in a program, this guarantees that the nodes on one side of the 
cut will always all precede or all follow the node most closely involved in the 
cut. The following algorithm is used to determine which cut to use to split a 
given phase. 

For simplicity of this discussion, assume for now that there is at most 
only one edge between any two nodes. For multiple references to the same 
array, the edge weight can be considered to be the sum of all communication 
operations for that array. Also, to better describe the algorithm, view the 
communication graph G = {V, E) in the form of an adjacency matrix (with 
source vertices on rows and destination vertices on columns). 

1. For each statement Si | [1, (^ Jfe l)]'0compute the cut of the graph 

between statements Si and Si+i by summing all the edges in the subma- 
trices specified by [-Si+i, and [Si+i, [Si, Si] 

2. While computing the cost of each cut also keep track of the current 
maximum cut. 

3. If there is more than one cut with the same maximum value, choose from 
this set the cut that separates the statements at the highest nesting level. 
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(a) 100 (P 2 > 1) Shift(510) 

(b) 3100 Transfer(510) 

Fig. 3.4. ADI2D communication graph with example initial edge costs 
(Statement numbers correspond to Figure 3.2) 



Table 3.1. Communication primitives 
(time in fj,s for a m byte message) 





Intel Paragon 


TMC CM-5 


Transfer(m) 


50 + 0.018m 


23 + 0.12m m \16 
86 + 0.12m m > 16 


Shift(m) 


2 


Transfer(m) 
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If there is more than one cut with the same highest nesting level, record 
the earliest and latest maximum cuts with that nesting level (forming a 
cut window). 

4. Split the phase using the selected cut. 



destination destination 




(a) Adjacency matrix (b) Actual representation (c) Efficient computation 
Fig. 3.5. Example graph illustrating the computation of a cut 



In Figure 3.5, the computation of the maximal cut on a smaller example 
graph with arbitrary weights is shown. The maximal cut is found to be be- 
tween vertices 3 and 4 with a cost of 41. This is shown both in the form of 
the sum of the two adjacency submatrices in Figure 3.5(a), and graphically 
as a cut on the actual representation in Figure 3.5(b). 

In Figure 3.5(c), the cut is again illustrated using an adjacency matrix, 
but the computation is shown using a more efficient implementation which 
only adds and subtracts the differences between two successive cuts using 
a running cut total while searching for the maximum cut in sequence. This 
implementation also provides much better locality than the full submatrix 
summary when analyzing the actual sparse representation since the differ- 
ences between two successive cuts can be easily obtained by traversing the 
incoming and outgoing edge lists (which correspond to columns and rows in 
the adjacency matrix respectively) of the node immediately preceding the 
cut. This takes 0{E) time on the actual representation, only visiting each 
edge twice - once to add it and once to subtract it. 

A new distribution is now selected for each of the resulting phases while 
inheriting any unspecified distributions (due to an array not appearing in a 
subphase) from the parent phase. This process is then continued recursively 
using the costs from the newly selected distributions corresponding to each 
subphase. As was shown in Figure 3.3, each level of the recursion is carried out 
in branch and bound fashion such that a phase is split only if the sum of the 
estimated execution times of the two resulting phases shows an improvement 
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Fig. 3.6. Partitioned communication graph for ADI2D 
(Statement numbers correspond to Figure 3.2.) 



over the original.^ In Figure 3.6, the partitioned communication graph is 
shown for ADI2D after the phase decomposition is completed. 

As mentioned in the cut algorithm, it is also possible to find several cuts 
which all have the same maximum value and nesting level forming a window 
over which the cut can be performed. This can occur since not all statements 
will necessarily generate communication resulting in either edges with zero 
cost or regions over which the propagated costs conserve edge flow, both 
of which will maintain a constant cut value. To handle cut windows, the 
phase should be split into two subphases such that the lower subphase uses 
the earliest cut point and the upper subphase uses the latest, resulting in 
overlapping phases. After new distributions are selected for each overlapping 
subphase, the total cost of executing the overlapped region in each subphase 
is examined. The overlap is then assigned to the subphase that resulted in the 
lowest execution time for this region. If they are equal, the overlapping region 
can be equivalently assigned to either subphase. Currently, this technique is 
not yet implemented for cut windows. We instead always select the earliest 
cut point in a window for the partitioning. 

To be able to bound the depth of the recursion without ignoring im- 
portant phases and distributions, the static partitioner must also obey the 
following property. A partitioning technique is said to be monotonic if it se- 
lects the best available partition for a segment of code such that (aside from 
the cost of redistribution) the time to execute a code segment with a selected 
distribution is less than or equal to the time to execute the same segment 

^ A further optimization can also be applied to bound the size of the smallest 
phase that can be split by requiring its estimated execution time to be greater 
than a “minimum cost” of redistribution. 
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with a distribution that is selected after another code segment is appended 
to the first. In practice, this condition is satisfied by the static partitioning 
algorithm that we are using. This can be attributed to the fact that conflicts 
between distribution preferences are not broken arbitrarily, but are resolved 
based on the costs imposed by the target architecture [17]. 

It is also interesting to note that if a cut occurs within a loop body, 
and loop distribution can be performed, the amount of redistribution can be 
greatly reduced by lifting it out of the distributed loop body and performing it 
in between the two sections of the loop. Also, if dependencies allow statements 
to be reordered, statements may be able to move across a cut boundary 
without affecting the cost of the cut while possibly reducing the amount of 
data to be redistributed. Both of these optimizations can be used to reduce 
the cost of redistribution but neither will be examined in this chapter. 

3.4 Phase and Phase Transition Selection 

After the program has been recursively decomposed into a hierarchy of 
phases, a Phase Transition Graph (PTG) is constructed. Nodes in the PTG 
are phases resulting from the decomposition while edges represent possible 
redistribution between phases as shown in Figure 3.7(a). Since it is possible 
that using lower level phases may require transitioning through distributions 
found at higher levels (to keep the overall redistribution costs to a minimum) , 
the phase transition graph is first sectioned across phases at the granularity 
of the lowest level of the phase decomposition.^ Redistribution costs are then 
estimated for each edge and are weighted by the dynamic execution count of 
the surrounding code. 

If a redistribution edge occurs within a loop structure, additional redis- 
tribution may be induced due to the control flow of the loop. To account for 
a potential “reverse” redistribution which can occur on the back edge of the 
iteration, the phase transition graph is also sectioned around such loops. The 
first iteration of a loop containing a phase transition is then peeled off and 
the phases of the first iteration of the body re-inserted in the phase transition 
graph as shown in Figure 3.7(b). Redistribution within the peeled iteration 
is only executed once while that within the remaining loop iterations is now 
executed (N ^ 1) times, where N is the number of iterations in the loop. The 
redistribution, which may occur between the first peeled iteration and the 
remaining iterations, is also multipled by {N 0 1) in order to model when 
the back edge causes redistribution (i.e., when the last phase of the peeled 
iteration has a different distribution than the first phase of the remaining 
ones). 

Once costs have been assigned to all redistribution edges, the best se- 
quence of phases and phase transitions is selected by computing the shortest 

^ Sectioned phases that have identical distributions within the same horizontal 

section of the PTG are actually now redundant and can be removed, if desired, 

without affecting the quality of tlie final solution. 
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Table 3.2. Detected phases and estimated execution times (sec) for ADI2D 
(Performance estimates correspond to 32 processors.) 



Level 0 
Level 1 

Level 2 



Op. Phases(s) 


Distribution 


Intel Paragon 


TMC CM-5 


I-XII 


,BL0CK 


1 


32 


22.151461 


39.496276 


I-VIII 


,BL0CK 


1 


32 


1.403644 


2.345815 


IX-XII 


BLOCK, 


32 


1 


0.602592 


0.941550 


I-III 


BLOCK, 


32 


1 


0.376036 


0.587556 


IV-VIII 


,BL0CK 


1 


32 


0.977952 


1.528050 



path on the phase transition graph. This is accomplished in O(V^) time 
(where V is now the number of vertices in the phase transition graph) using 
Dijkstra’s single source shortest path algorithm [40]. 

After the shortest path has been computed, the loop peeling performed 
on the PTG can be seen to have been necessary to obtain the best solution 
if the peeled iteration has a different transition sequence than the remaining 
iterations. Even if the peeled iteration does have different transitions, not 
actually performing loop peeling on the actual code will only incur at most 
one additional redistribution stage upon entry to the loop nest. This will 
not overly affect performance if the execution of the entire loop nest takes 
significantly longer than a single redistribution operation, which is usually the 
case especially if the redistribution considered within the loop was actually 
accepted when computing the shortest path. 

Using the cost models for an Intel Paragon and a Thinking Machines 
CM-5, the distributions and estimated execution times reported by the static 
partitioner for the resulting phases (described as ranges of operational phases) 
are shown in Table 3.2. The performance parameters of the two machines are 
similar enough that the static partitioning actually selects the same distri- 
bution at each phase for each machine. The times estimated for the static 
partitioning are slightly higher than those actually observed, resulting from 
a conservative assumption regarding pipelines^ made by the static cost es- 
timator [15], but they still exhibit similar enough performance trends to be 
used as estimates. For both machines, the cost of performing redistribution is 
low enough in comparison to the estimated performance gains that a dynamic 
distribution scheme is selected, as shown by the shaded area in Figure 3.7(b). 

Pseudo-code for the dynamic partitioning algorithm is presented in Fig- 
ure 3.8 to briefly summarize both the phase decomposition and the phase 



Initially, a BLOCK, BLOCK distribution was selected by the static partitioner for 
(only) the first step of the phase decomposition. As the static performance 
estimation framework does not currently take into account any overlap between 
communication and computation for pipelined computations, we decided that 
this decision was due to the conservative performance estimate. For the analysis 
presented for ADI2D, we bypassed this problem by temporarily restricting the 
partitioner to only consider 1-D distributions. 




460 Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee 



transition selection procedures as described. As distributions for a given phase 
are represented as a set of variables, each of which having an associated dis- 
tribution, a masking union set operation is used to inherit unspecified dis- 
tributions {(list ^disti). A given variable’s distribution in the dist set will 
be replaced if it also has a distribution in the disti set thus allowing any 
unspecified distributions in subphase i (disti) to be inherited from its parent 
(dist). These sets and the operations performed on them will be described in 
more detail in Section 4.3. 

Since the use of inheritance during the phase decomposition process im- 
plicitly maintains the coupling between individual array distributions, redis- 
tribution at any stage will only affect the next stage. This can be contrasted 
to the technique proposed by Bixby, Kennedy, and Kremer [6] which first 
selects a number of partial candidate distributions for each phase specified 
by the operational definition. Since their phase boundaries are chosen in the 
absence of flow information, redistribution can affect stages at any distance 
from the current stage. This causes the redistribution costs to become binary 
functions depending on whether or not a specihc path is taken, therefore, ne- 
cessitating the need for 0-1 integer programming. In [27] they do agree, how- 
ever, that 0-1 integer programming is not necessary when all phases specify 
complete distributions (such as in our case). In their work, this occurs only 
as a special case in which they specify complete phases from the innermost to 
outermost levels of a loop nest. For this situation they show how the solution 
can be obtained using a hierarchy of single source shortest path problems in 
a bottom-up fashion (as opposed to solving only one shortest path problem 
after performing a top-down phase decomposition as in our approach). 

Up until now, we have not described how to handle control flow other than 
for loop constructs. More general flow (caused by conditionals or branch oper- 
ations) can be viewed as separate paths of execution with different frequencies 
of execution. The same techniques that have been used for scheduling assem- 
bly level instructions by selecting traces of interest [12] or forming larger 
blocks from sequences of basic blocks [23] in order to optimize the most fre- 
quently taken paths can also be applied to the phase transition problem. 
Once a single trace has been selected (using profiling or other criteria) its 
phases are obtained using the phase selection algorithm previously described 
but ignoring all code off the main trace. Once phases have been selected, 
all off-trace paths can be optimized separately by hrst setting their stop 
and start nodes to the distributions of the phases selected for the points at 
which they exit and re-enter the main trace. Each off-trace path can then 
be assigned phases by applying the phase selection algorithm to each path 
individually. Although this specific technique is not currently implemented 
in the compiler, but will be addressed in future work, other researchers have 
also been considering it as a feasible solution for selecting phase transitions 
in the presence of general control flow [3] . 
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Partition (program) 

1 cutlist ^ 0 

2 dist ^ STATIC-PARTITIONING(pro 5 ram) 

3 phases <— DECOMPOSE-PHASE(progrom, dist, cutlist) 

4 ptg ^ SELECT-REDISTRIBUTION(p/lOSes, cutlist) 

5 Assign distributions based on shortest phase recorded in ptg 

DECOMPOSE-PHASE(p/iose, dist, cutlist) 

1 Add phase to list of recognized phases 

2 Construct the communication graph for the phase 

3 cut <— MAX-CuT(p/iose) 

4 if VALUE(cwt) =0 C> No communication in phase 

5 then return 

6 Relocate cut to highest nesting level of identical cuts 

7 phasci, phasc 2 phase 

8 [> Note: if cut is a window, phase ^ and phasc 2 will overlap 

9 disti ^ STATIC-PARTITIONING(p/iase^) 

10 dist 2 ^ STATIC-PARTITIONING(p/lOSe 2 ) 

11 I> Inherit any unspecified distributions from parent 

12 disti <— dist (oj disti 

13 dist 2 <— dist |oJ dist 2 

14 if (cost(phasei) + cost(phase 2 )) < cost(phase) 

15 then I> If cut is a window, phase-^ and phasc 2 overlap 

16 if LAST_STMTNUM (phase 1 ) > FIRST_STMTNUM(phose 2 ) 

17 then RESOLVE-OvERLAP(cMt, p/iase^, p/iase 2 ) 

18 List-Insert(cmC cutlist) 

19 phase— ileft = DECOMPOSE-PHASE(phase]^, disti, cutlist) 

20 phase— aright = DECOMPOSE-PHASE(phose 2 , dist 2 , cutlist) 

21 else phase-^left = NULL 

22 phase^right = NULL 

23 return (phase) 

SELEGT-REDISTRIBUTION(p/iases, CUtUst) 

1 if cutlist = 0 

2 then return 

3 ptg <— Constkuct-PTG ( phases, cutlist) 

4 Divide ptg horizontally at the recursion lowest level 

5 for each loop in phases 

6 do if Zoopcontains a cut at its nesting level 

7 then Divide ptg at loop boundaries 

8 Peel(Zoop, pZp) 

9 Estimate the interphase redistribution costs for ptg 

10 Compute the shortest phase transition path on ptg 

11 return (ptg) 



Fig. 3.8. Pseudo-code for the partitioning algorithm 
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4. Data Redistribution Analysis 

The intermediate form of a program within the framework, previously shown 
in Figure 1.1, specifies both the distribution of every array at every point in 
the program as well as the redistribution required to move from one point 
to the next. The different paths through the framework involve passes which 
process the available distribution information in order to obtain the missing 
information required to move from one representation to another. 

The core of the redistribution analysis portion of the framework is built 
upon two separate interprocedural data-flow problems which perform distri- 
bution synthesis and redistribution synthesis (which will be described later in 
Section 5). These two data-flow problems are both based upon the problem 
of determining both the inter- and intraprocedural reaching distributions for 
a program. Before giving further details of how these transformations are 
accomplished through the use of these two data-flow problems, we will first 
describe the idea of reaching distributions and the basic representations we 
use to perform this analysis. 

4.1 Reaching Distributions and the Distribution Flow Graph 

The problem of determining which distributions reach any given point taking 
into account control flow in the program is very similar to the computation 
of reaching definitions [1] . In classic compilation theory a control flow graph 
consists of nodes (basic blocks) representing uninterrupted sequences of state- 
ments and edges representing the flow of control between basic blocks. For 
determining reaching distributions, an additional restriction must be added 
to this definition. Not only should each block B be viewed as a sequence of 
statements with flow only entering at the beginning and leaving at the end, 
but the data distribution for the arrays defined or used within the block is 
also not allowed to change. In comparison to the original definition of a basic 
block, this imposes tighter restrictions on the extents of a block. Using this 
definition of a block in place of a basic block results in what we refer to as 
the distribution flow graph (DFG). This representation differs from [9] as 
redistribution operations now merely augment the definition of basic block 
boundaries as opposed to forming the nodes of the graph. 

Since the definition of the DFG is based upon the CFG, the CFG can be 
easily transformed into a DFG by splitting basic blocks at points at which a 
distribution changes as shown in Figure 4.1. This can be due to an explicit 
change in distribution, as specified by the automatic data partitioner, or by 
an actual HPF redistribution directive. If the change in distribution is due 
to a sequence of redistribution directives, the overall effect is assigned to 
the block in which they are contained; otherwise, a separate block is created 
whenever executable operations are interspersed between the directives. 
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(a) Distribution split (b) Redistribution split 

Fig. 4.1. Splitting CFG nodes to obtain DFG nodes 

4.2 Computing Reaching Distributions 

Using this view of a block in a DFG and by viewing array distributions 
as definitions, the same data-flow framework used for computing reaching 
definitions [1] can now be used to obtain the reaching distributions by defining 
the following sets for each block i? in a function: 

=>DIST(i?) - distributions present when executing block B 

=>REDIST(i?) - redistributions performed upon entering block B 

=>GEN(i?) - distributions generated by executing block B 

=>KILL(R) - distributions killed by executing block B 

=>IN(i?) - distributions that exist upon entering block B 

=>OUT(i3) - distributions that exist upon leaving block B 

=>DEF(i?), USE(R) - variables defined or used in block B 

It is important to note that GEN and KILL are specified as the distri- 
butions generated or killed by executing block B as opposed to entering (re- 
distribution at the head of the block) or exiting (redistribution at the tail of 
the block) in order to allow both forms of redistribution. GEN and KILL are 
initialized by DIST or REDIST (depending on the current application as will 
be described in Section 5) and may be used to keep track of redistributions 
that occur on entry (e.g., HPF redistribution directives or functions with 
prescriptive distributions) or exit (e.g., calls to functions which internally 
change a distribution before returning). To perform interprocedural analysis, 
the function itself also has IN and OUT sets, which contain the distributions 
present upon entry and summarize the distributions for all possible exits. 

Once the sets have been defined, the following data-flow equations are 
iteratively computed for each block until the solution OUT(R) converges for 
every block B (where PRED(i?) are the nodes which immediately precede B 
in the flow of the program): 



m{B) 


= OUT(P) 


(4.1) 




P PRED(B) 




OUT(R) 


= GEN(R) (IN(R) ® KILL(R)) 


(4.2) 
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Since the confluence operator is a union, both IN and OUT never decrease 
in size and the algorithm will eventually halt. By processing the blocks in the 
flow graph in a depth-first order, the number of iterations performed will 
roughly correspond to the level of the most deeply nested statement, which 
tends to be a fairly small number on real programs [1]. 

As can be seen from Eqs. (4.1) and (4.2), the DEF and USE sets are 
actually not used to compute reaching distributions, but will have other uses 
for optimizing redistribution which will be explained in more detail in Sec- 
tion 5.2). 

4.3 Representing Distribution Sets 

To represent a distribution set in a manner that would provide efficient set 
operations, the bulk of the distribution information associated with a given 
variable is stored in its symbol table entry as a distribution table. Bit vec- 
tors are used within the sets themselves to specify distributions which are 
currently active for a given variable. Since a separate symbol table entry is 
created for each variable within a given scope, this provides a clean interface 
for accepting distributions from the HPF front end [21]. While the front end is 
processing HPF distribution or redistribution directives, any new distribution 
information present for a given variable is simply added to the corresponding 
distribution table for later analysis. 




Symbol Table 

Fig. 4.2. Distribution set using bit vectors 



As shown in Figure 4.2, the actual distribution sets are maintained as 
linked lists with a separate node representing each variable with a bit vector 
(corresponding to the entries in the distribution table for that variable) to 
indicate which distributions are currently active for the variable. To main- 
tain efficiency while still retaining the simplicity of a list, the list is always 
maintained in sorted order by the address of the variable’s entry in the sym- 
bol table to facilitate operations between sets. This allows us to implement 
operations on two sets by merging them in only 0{n) bit vector operations 
(where n is the number of variables in a given set). 
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Since these sets are now actually sets of variables which each contain a 
set representing their active distributions, SET„(„. will be used to specify the 
variables present in a given distribution set, SET. For example, the notation 
SET^oy. can be used to indicate the inverse of the distributions for each vari- 
able contained within the set as opposed to an inverse over the universe of 
all active variables (which would be indicated as SET). 

In addition to providing full union, intersection, and difference operations 
( , \ , iS>) which operate on both levels of the set representation (between 

the variable symbols in the sets as well as between the bit vectors of identical 
symbols) masking versions of these operations ( -^ are also provided 

which operate at only the symbol level. In the case of a masking union (a 
a union is performed at the symbol level such that/ any distributions for a 
variable appearing in set a will be replaced by distributions in set b. This 
allows new distributions in b to be added to a set while replacing any existing 
distributions in a. Masking intersections (a '^b) and differences (a ®=6) act 
somewhat differently in that the variables |n set a are either selected or 
removed (respectively) by their appearancean set b. These two operations 
are useful for implementing existence operations (e.g., Oyar^var fl « \=b, 

^var^var it ^ f^^). I 

5. Interprocedural Redistribution Analysis 

Since the semantics of HPF require that all objects accessible to the caller 
after the call are distributed exactly as they were before the call [26], it is 
possible to first completely examine the context of a call before considering 
any distribution side effects due to the call. It may seem strange to say 
that there can be side effects when we just said that the semantics of HPF 
preclude it. To clarify this statement, such side effects are allowed to exist, 
but only to the extent that they are not apparent outside of the call. As 
long as the view specified by the programmer is maintained, the compiler is 
allowed to do whatever it can to optimize both the inter- and intraprocedural 
redistributions so long as the resulting distributions used at any given point 
in the program are not changed. 

The data partitioner explicitly assigns different distributions to individual 
blocks of code serving as an automated mechanism for converting sequential 
Fortran programs into efficient HPF programs. In this case, the framework 
is used to synthesize explicit redistribution operations in order to preserve 
the meaning of what the data partitioner intended in the presence of HPF 
semantics. In HPF, on the other hand, dynamic distributions are described 
by specifying the transitions between different distributions (through explicit 
redistribution directives or implicit redistribution at function boundaries). 
With the framework it is possible to convert an arbitrary HPF program into 
an optimized HPF program containing only explicit redistribution directives 
and descriptive [26] function arguments. 
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for calling context for side-effects 

Fig. 5.1. Example call graph and depth- first traversal order 

In Figure 5.1, an example call graph is shown to help illustrate the dis- 
tribution and redistribution synthesis phases of the interprocedural analysis. 
If distribution information is not present (i.e., HPF input), distribution syn- 
thesis is first performed in a top-down manner over a program’s call graph 
to compute which distributions are present at every point within a given 
function. By establishing the distributions that are present at each call site, 
clones are also generated for each unique set of input distributions obtained 
for the called functions. Redistribution synthesis is then applied in a bottom- 
up manner over the expanded call graph to analyze where the distributions 
are actually used and generates the redistribution required within a function. 

Since this analysis is interested in the effects between an individual 
caller/callee pair, and not in summarizing the effects from all callers be- 
fore examining a callee, it is not necessary to perform a topological traversal 
for the top-down and bottom-up passes over the call graph. In this case, it 
is actually more intuitive to perform a depth-first pre-order traversal of the 
call graph (shown in Figure 5.1(a)) to fully analyze a given function before 
proceeding to analyze any of the functions it calls and to perform a depth- 
first post-order traversal (shown in Figure 5.1(b)) to fully analyze all called 
functions before analyzing the caller. 

One other point to emphasize is that these interprocedural techniques 
can be much more efficient than analyzing a fully inlined version of the same 
program since it is possible to prune the traversal at the point a previous 
solution is found for a function in the same calling context. In Figure 5.1, 
asterisks indicate points at which a function is being examined after having 
already been examined previously. If the calling context is the same as the 
one used previously, the traversal can be pruned at this point reusing infor- 
mation recorded from the previous context. Depending on how much reuse 
occurs, this factor can greatly reduce the amount of time the compiler spends 
analyzing a program in comparison to a fully inlined approach. 
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Referring back to Figure 1.1 once again, the technique for performing 
distribution synthesis will be described in Section 5.1 while redistribution 
synthesis will be described in Section 5.2. The static distribution assignment 
(SDA) technique, will be described in Section 5.3, but as the HPF redistri- 
bution directive conversion is very straight-forward, it will not be discussed 
further in this section. More detailed descriptions of these techniques and the 
implementation of the framework can be found in [30]. 

5.1 Distribution Synthesis 

When analyzing HPF programs, it is necessary to first perform distribution 
synthesis in order to determine which distributions are present at every point 
in a program. Since HPF semantics specify that any redistribution (implicit or 
explicit) due to a function call is not visible to the caller, each function can be 
examined independently of the functions it calls. Only the input distributions 
for a given function and the explicit redistribution it performs have to be 
considered to obtain the reaching distributions within a function. 

Given an HPF program, nodes (or blocks) in its DFG are delimited by 
the redistribution operations which appear in the form of HPF REDISTRIBUTE 
or REALIGN directives. As shown in Figure 5.2, the redistribution operations 
assigned to a block B represent the redistribution that will be performed when 
entering the block on any input path (indicated by the set REDIST(i?)) as 
opposed to specifying the redistribution performed for each incoming path 
(REDIST(H,Hi) or REDIST(H,H 2 ) in the figure). 

If the set GEN(H) is viewed as the distributions which are generated and 
KILL(H) as the distributions which are killed upon entering the block, this 




Fig. 5.2. Distribution and redistribution synthesis 
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problem can now be cast directly into the reaching distribution data-flow 
framework by making the following assignments: 

Data-flow initialization: 

REDIST(i?) = from directives 
DIST(B) = 

GEN(B) = REDIST(B) KILL(B) = REDIST„a^(B) 

OUT(B) = REDIST(B) m{B) = 

Data-flow solution: 

DIST(B) = OUT(B) 

According to the HPF standard, a REALIGN operation only affects the 
array being realigned while a REDISTRIBUTE operation should redistribute 
all arrays currently aligned to the given array being redistributed (in order 
to preserve any previous specified alignments). The current implementation 
only records redistribution information for the array immediately involved 
in a REDISTRIBUTE operation. This results in only redistributing the array 
involved in the directive and not all of the alignees of the target to which 
it is aligned. In the future, the implementation could be easily extended to 
support the full HPF interpretation of REDISTRIBUTE by simply recording the 
same redistribution information for all alignees for the target template of the 
array involved in the operation. Due to the properties of REALIGN, this will 
also require first determining which templates arrays are aligned to at every 
point in the program (i.e., reaching alignments) using similar techniques. 

5.2 Redistribution Synthesis 

After the distributions have been determined for each point in the program, 
the redistribution can be optimized. Instead of using either a simple copy- 
in/copy-out strategy or a complete redistribution of all arguments upon every 
entry and exit of a function, any implicit redistribution around function calls 
can be reduced to only that which is actually required to preserve HPF 
semantics. Redistribution operations (implicitly specified by HPF semantics 
or explicitly specified by a programmer) that result in distributions which 
would not otherwise be used before another redistribution operation occurs 
are completely removed in this pass. 

Blocks are now delimited by changes in the distribution set. The set of 
reaching distributions previously computed for a block B represent the dis- 
tributions which are in effect when executing that block (indicated by the 
set DIST(i?) in Figure 5.2). For this reason, the DIST(i?) sets are first re- 
stricted to only the variables defined or used within block B. Redistribution 
operations will now only be performed between two blocks if there is an inter- 
vening definition or use of a variable before the next change in distribution. 
Since we have also chosen to use a caller redistributes model, the GEN(i?) 
and KILL(i?) sets are now viewed as the distributions which are generated or 
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killed upon leaving block B. Using these definitions, this problem can now be 
cast directly into the reaching distribution data-flow framework by making 
the following assignments: 

Data-flow initialization: 

REDIST(B) = 

DIST(B) = DIST(B) \ (DEF(B) USE(B)) 

GEN(B) = DIST(B) 1 KILL(B) = DIST™,-(R) 

OUT(B) = DIST(B) ^ m{B) = 

Data-flow solution: 

REDIST(B) = DIST„,,(B) *(IN„a.(B) 0 DIST,,,(B)) 

As will be seen later, using a caller redistributes model exposes many in- 
terprocedural optimization opportunities and also cleanly supports function 
calls which may require redistribution on both their entry and exit. Since 
REDIST(i?) is determined from both the DIST and IN sets, DIST(i?) rep- 
resents the distributions needed for executing block B, while the GEN and 
KILL sets will be used to represent the exit distribution (which may or may 
not match DIST). 

By first restricting the DIST(i?) sets to only those variables defined or 
used within block B, redistribution is generated only where it is actually 
needed - the locations in which a variable is actually used in a distribution 
different from the current one (demand-driven, or lazy, redistribution). Al- 
though it will not be examined here, it would also be possible to take a lazy 
redistribution solution and determine the earliest possible time that the re- 
distribution could be performed (eager redistribution) in order to redistribute 
an array when a distribution is no longer in use. The area between the eager 
and lazy redistribution points forms a window over which the operation can 
be performed to obtain the same effect. As will be shown later, it would be 
advantageous to position multiple redistribution operations in overlapping 
windows to the same point in the program in order to aggregate the com- 
munication thereby reducing the amount of communication overhead [30]. 
As the lazy redistribution point is found using a forward data-flow (reaching 
distributions) problem, it would be possible to And the eager redistribution 
point by performing some additional bookkeeping to record the last use of a 
variable as the reaching distributions are propagated along the flow graph; 
however, such a technique is not currently implemented in PARADIGM. In 
comparison to other approaches, interval analysis has also been used to deter- 
mine eager /lazy points for code placement, but at the expense of a somewhat 
more complex formulation [43]. 

5.2.1 Optimizing Invariant Distributions. Besides performing redistri- 
bution only when necessary, it is also desirable to only perform necessary 
redistribution as infrequently as possible. 
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Semantically loop invariant distribution regions® can be first grown be- 
fore synthesizing the redistribution operations. All distributions that do not 
change within a nested statement (e.g., a loop or if structure) are recorded 
on the parent statement (or header) for that structure. This has the effect 
of moving redistribution operations which result in the invariant distribution 
out of nested structures as far as possible (as was also possible in [18]). 

As a side effect, loops which are considered to contain an invariant dis- 
tribution no longer propagate previous distributions for the invariant arrays. 
Since redistribution is moved out of the loop, this means that for the (ex- 
tremely rare) special case of a loop invariant distribution (which was not 
originally present outside of the loop) contained within an undetectable zero 
trip loop, only the invariant distribution from within the loop body is propa- 
gated even though the loop nest was never executed. As this is only due to the 
way invariant distributions are handled, the data-flow handles non-invariant 
distributions as expected for zero trip loops (an extra redistribution check 
may be generated after the loop execution). 

5.2.2 Multiple Active Distributions. Even though it is not specifically 
stated as such in the HPF standard, we will consider an HPF program in 
which every use or definition of an array has only one active distribution 
to be well-behaved. Since PARADIGM cannot currently compile programs 
which contain references with multiple active distributions, this property is 
currently detected by examining the reaching distribution sets for every node 
(limited by DEF /USE) within a function. A warning is issued if any set con- 
tains multiple distributions for a given use or definition of a variable stating 
that the program is not well-behaved. 

In the presence of function calls, it is also possible to access an array 
through two or more paths when parameter aliasing is present. If there is an 
attempt to redistribute only one of the aliased symbols, the different aliases 
now have different distributions even though they actually refer to the same 
array. This form of multiple active distributions is actually considered to be 
non-conforming in HPF [26] as it can result in consistency problems if the 
same array were allowed to occupy two different distributions. As it may be 
difficult for the programmers to make this determination, this can be auto- 
matically detected by determining if the reaching distribution set contains 
different distributions for any aliased arrays.® 



® Invariant redistribution within a loop can technically become non-invariant 
when return distributions from a function call within a loop nest are allowed 
to temporarily exist in the caller’s scope. Such regions can still be treated as 
invariant since this is the view HPF semantics provide to the programmer. 

® The parain_alias pass in Parafrase-2 [34], which PARADIGM is built upon, is 
first run to compute the alias sets for every function call. 
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5.3 Static Distribution Assignment (SDA) 

To utilize the available memory on a given parallel machine as efficiently 
as possible, only the distributions that are active at any given point in the 
program should actually be allocated space. It is interesting to note that as 
long as a given array is distributed among the same total number of proces- 
sors, the actual space required to store one section of the partitioned array is 
the same no matter how many array dimensions are distributed.^ By using 
this observation, it is possible to statically allocate the minimum amount of 
memory by associating all possible distributions of a given array to the same 
area of memory. 

Static Distribution Assignment (SDA) (inspired indirectly by the Static 
Single Assignment (SSA) form [10]) is a process we have developed in which 
the names of array variables are duplicated and renamed statically based 
on the active distributions represented in the corresponding DIST sets. As 
names are generated, they are assigned a static distribution corresponding 
to the currently active dynamic distribution for the original array. The new 
names will not change distribution during the course of the program. Redis- 
tribution now takes the form of an assignment between two different statically 
distributed source and destination arrays (as opposed to rearranging the data 
within a single array). 

To statically achieve the minimum amount of memory allocation required, 
all of the renamed duplicates of a given array are declared to be “equivalent.” 
The EQUIVALENCE statement in Fortran 77 allows this to be performed at the 
source level in a manner somewhat similar to assigning two array pointers to 
the same allocated memory as is possible in C or Fortran 90. Redistribution 
directives are also now replaced with actual calls to a redistribution library. 

Because the different static names for an array share the same memory, 
this implies that the communication operations used to implement the redis- 
tribution should read all of the source data before writing to the target. In 
the worst case, an entire copy of a partitioned array would have to be buffered 
at the destination processor before it is actually received and moved into the 
destination array. However, as soon as more than two different distributions 
are present for a given array, the EQUIVALENCE begins to pay off, even in the 
worst case, in comparison to separately allocating each different distribution. 
If the performance of buffered communication is insufficient for a given ma- 
chine (due to the extra buffer copy), non-buffered communication could be 
used instead thereby precluding the use of EQUIVALENCE (unless some form 
of explicit buffering is performed by the redistribution library itself). 



^ Taking into account distributions in which the number of processors allocated 
to a given array dimension does not evenly divide the size of the dimension, 
or degenerate distributions in which memory is not evenly distributed over all 
processors, it can also be equivalently said that there is an amount of memory 
which can store all possible distributions with very little excess. 
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REAL A$0(N, N), A$1(N,N) 




!HPF$ DISTRIBUTE (CYCLIC, *):;A$0 


REAL A(N, N) 


!HPF$ DISTRIBUTE (BLOCK, BLOCK) ;;A$1 


!HPF$ DISTRIBUTE (CYCLIC, *);:A 


EQUIVALENCE (A$0, A$l) 




INTEGER A$cid 


A(i, j) = ... 


A$cid = 0 


!HPF$ REDISTRIBUTE (BLOCK, BLOCK) ;:A 


A$0(i, j) = ... 


... = A(i, j) 


CALL reconf ig(A$l , 1, A$cid) 




. . . = A$l(i, j) 



(a) Before SDA (b) After SDA 



Fig. 5.3. Example of static distribution assignment 

In Figure 5.3, a small example is shown to illustrate this technique. In 
this example, a redistribution operation on A causes it to be referenced using 
two different distributions. A separate name is statically generated for each 
distribution of A, and the redistribution directive is replaced with a call to a 
run-time redistribution library [36]. The array accesses in the program can 
now be compiled by PARADIGM using techniques developed for programs 
which only contain static distributions [4] by simply ignoring the communi- 
cation side effects of the redistribution call. 

As stated previously, if more than one distribution is active for any given 
array reference, the program is considered to be not well-behaved, and the 
array involved can not be directly assigned a static distribution. In certain 
circumstances, however, it may be possible to perform code transformations 
to make an HPF program well-behaved. For instance, a loop that contained 
multiple active distributions on the entry to its body due only to a distri- 
bution from the loop back edge (caused by redistribution within the loop) 
that wasn’t present on the loop entry would not be well-behaved. If the first 
iteration of that loop were peeled off, the entire loop body would now have a 
single active distribution for each variable and the initial redistribution into 
this state would be performed outside of the loop. This and other code trans- 
formations which help reduce the number of distributions reaching any given 
node will be the focus of further work in this area. 



6. Results 

To evaluate the quality of the data distributions selected using the techniques 
presented in this chapter, as implemented in the PARADIGM compiler, we 
analyze three programs which exhibit different access patterns over the course 
of their execution. These programs are individual Fortran 77 subroutines 
which range in size from roughly 60 to 150 lines of code: 
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=> Synthetic HPF redistribution example 

=>2-D Alternating Direction Implicit iterative method (ADI2D) [14] 
=>Shallow water weather prediction benchmark [37] 



6.1 Synthetic HPF Redistribution Example 

In Figure 6.1(a), a synthetic HPF program is presented which performs a 
number of different tests (described in the comments appearing in the code) 
of the optimizations performed by the framework. In this program, one array, 
X, is redistributed both explicitly using HPF directives and implicitly through 
function calls using several different interfaces. Two of the functions, fund 
and func2, have prescriptive interfaces [26] which may or may not require 
redistribution (depending on the current configuration of the input array). 
The first function differs from the second in that it also redistributes the 
array such that it returns with a different distribution than which it was 
called. The last function, func3, differs from the first two in that it has an 
(implicit) transcriptive interface [26]. Calls to this function will cause it to 
inherit the current distribution of the actual parameters. 

Several things can be noted when examining the optimized HPF shown in 
Figure 6.1(b).® First of all, the necessary redistribution operations required to 
perform the implicit redistribution at the function call boundaries have been 
made explicit in the program. Here, the interprocedural analysis has com- 
pletely removed any redundant redistribution by relaxing the HPF semantics 
allowing distributions caused by function side effects to exist so long as they 
do not affect the original meaning of the program. For the transcriptive func- 
tion, func3, the framework also generated two separate clones, func3$0 and 
func3$l, corresponding to two different active distributions appearing in a 
total of three different calling contexts. 

Two warnings were also generated by the compiler, inserted by hand as 
comments in Figure 6.1(b), indicating that there were (semantically) multi- 
ple reaching distributions for two references of x in the program. The first 
reference actually does have two reaching distributions due to a conditional 
with redistribution performed on only one path. The second, however, occurs 
after a call to a prescriptive function, fund, which implicitly redistributes 
the array to conform to its interface. Even though the redistribution for this 
function will accept either of the two input distributions and generate only a 
single distribution of x for the function, the following reference of x semanti- 
cally still has two reaching distributions - hence the second warning. 

The optimization of loop invariant redistribution operations can also be 
seen in the first loop nest of this example in which a redistribution opera- 
tion on X is performed at the deepest level of a nested loop. If there are no 
references of x before the occurrence of the redistribution (and no further 



The HPF output, generated by PARADIGM, has been slightly simplified by 
removing unnecessary alignment directives from the figure to improve its clarity. 




474 Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee 





PROGRAM test 




PROGRAM test 




INTEGER x(10,10) 




INTEGER x(10,10) 


c 


*♦* For tests involving statement padding 




INTEGER a, n 




INTEGER a 


!HPF$ 


DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x 


!HPF$ 


DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x 




x(l,l) = 1 


c 


*** Use of initial distribution 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 




x(l,l) = 1 




DO i = 1,10 


c 


*** Testing loop invariant redistribution 




DO j = 1,10 




DO i = 1,10 




a = 0 




DO j = 1,10 




x(i.j) = 1 




a = 0 




a = 0 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 




END DO 




x(i,j) = 1 




END DO 




a = 0 




a = 0 




ENDDO 




IF (x(i,j) .GT. 1) THEN 




ENDDO 


!HPF$ 


REDISTRIBUTE (BLOCK, BLOCK) :: x 




a = 0 




xCi.j) = 2 


c 


*** Testing unnecessary redistribution 




CALL func3$0(x,n) 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 




ELSE 




if (x(i,j) .gt. 1) then 




x(i,j) = 3 


c 


*** Testing redistribution in a conditional 




END IF 


!HPF$ 


REDISTRIBUTE (BLOCK, BLOCK) :: x 


c 


**♦ WARNING: too many dists (2) for x 




x(i,j) = 2 




x(l,l) = 2 




call func3(x,n) 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 




else 




CALL funcl(x,n) 




x(i, j) = 3 




DO i = 1,10 




endif 




DO j = 1,10 


c 


Uses with multiple reaching distributions 


c 


♦** WARNING: too many dists (2) for x 




x(l,l) = 2 




x(j,i) = 2 




call funcl(x,n) 




END DO 




DO i = 1,10 




END DO 




DO j = 1,10 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 




x(j,i) = 2 




CALL funcl(x,n) 




ENDDO 




CALL func2(x,n) 




ENDDO 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 


!HPF$ 


REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x 




CALL funcl(x,n) 


c 


Testing chaining of function arguments 


!HPF$ 


REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x 




call funcl(x,n) 




DO i = 1,10 




call func2(x,n) 




DO j = 1,10 




call funcl(x,n) 




CALL func3$l(x,n) 


c 


*** Testing loop invarieint due to return 




END DO 




DO i = 1,10 




END DO 




DO j = 1,10 




a = 1 


c 


*** Testing treinscriptive function cloning 




a = 0 




call func3(x,n) 




CALL func3$l(x,n) 




ENDDO 




END 




ENDDO 
a = 1 




INTEGER FUNCTION fund (a, n) 


c 


Testing unused distribution 




INTEGER n, a(n,n) 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: x 


!HPF$ 


DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a 




a = 0 




a(l,l) = 1 


c 


*** Testing "semantically killed" distribution 




a(l,2) = 1 


!HPF$ 


REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x 


!HPF$ 


REDISTRIBUTE (CYCLIC, CYCLIC) :: a 




call func3(x,n) 




a(l,3) = 1 




END 




END 




integer function fund (a, n) 




INTEGER FUNCTION func2(y,n) 


c 


*** Prescriptive function with different return 




INTEGER n, y(n,n) 




integer n, a(n, n) 


!HPF$ 


DYNAMIC, DISTRIBUTE(CYCLIC, CYCLIC) :: y 


!HPF$ 


DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a 




yCi.i) = 2 




a(l,l) = 1 




END 


!HPF$ 


REDISTRIBUTE (BLOCK, CYCLIC) :: a 
a(l,2) = 1 




INTEGER FUNCTION func3$l(n,z) 


!HPF$ 


REDISTRIBUTE (CYCLIC, CYCLIC) :: a 




INTEGER n, z(n,n) 




a(l,3) = 1 


!HPF$ 


DISTRIBUTE(CYCLIC (3) , CYCLIC) :: z 




end 




z(l,l) = 3 
END 




integer function func2(y,n) 






c 


Prescriptive function with identical return 




INTEGER FUNCTION func3$0(n,z) 




integer n, y(n,n) 




INTEGER n, z(n,n) 


!HPF$ 


DYNAMIC, DISTRIBUTE (CYCLIC, CYCLIC) :: y 


!HPF$ 


DISTRIBUTE(BLDCK, BLOCK) :: z 




y(l,l) = 2 




z(l,l) = 3 




end 




END 




integer function func3(z,n) 






c 


*** (implicitly) Treinscriptive function 








integer n, z(n,n) 
z(l,l) = 3 








end 







(a) Before optimization (b) After optimization 



Fig. 6.1. Synthetic example for interprocedural redistribution optimization 
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redistribution performed in the remainder of the loop), then x will always 
have a (BLOCK, CYCLIC) distribution within the loop body. This situation is 
detected by the framework and the redistribution operation is re-synthesized 
to occur outside of the entire loop nest. It could be argued that even when 
it appeared within the loop, the underlying redistribution library could be 
written to be smart enough to only perform the redistribution when it is nec- 
essary (i.e., only on the first iteration) so that we have not really optimized 
away 10^ redistribution operations. Even in this case, this optimization has 
still completely eliminated (10^-1) check operations that would have been 
performed at run time to determine if the redistribution was required. 

As there are several other optimizations performed on this example, which 
we will not describe in more detail here, the reader is directed to the comments 
in the code for further information. 

6.2 2-D Alternating Direction Implicit (ADI2D) Iterative Method 

In order to evaluate the effectiveness of dynamic distributions, the ADI2D 
program, with a problem size of 512 512,® is compiled with a fully static 

distribution (one iteration shown in Figure 6.2(a)) as well as with the selected 
dynamic distribution^® (one iteration shown in Figure 6.2(b)). These two 
parallel versions of the code were run on an Intel Paragon and a Thinking 
Machines CM-5 to examine their performance on different architectures. 




time 



(a) Static (pipelined) 




time 



(b) Dynamic (redistribution) 

Fig. 6.2. Modes of parallel execution for AD12D 

The static scheme illustrated in Figure 6.2(a) performs a shift operation 
to initially obtain some required data and then satisfies two recurrences in 

® To prevent poor serial performance from cache-line aliasing due to the power 
of two problem size, the arrays were also padded with an extra element at the 
end of each column. This optimization, although here performed by hand, is 
automated by even aggressive serial optimizing compilers. 

In the current implementation, loop peeling is not performed on the gener- 
ated code. As previously mentioned in Section 3.4, the single additional startup 
redistribution due to not peeling will not be significant in comparison to the 
execution of the loop (containing a dynamic count of 600 redistributions). 
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the program using software pipelining [19,33]. Since values are being propa- 
gated through the array during the pipelined computation, processors must 
wait for results to be computed before continuing with their own part of the 
computation. According to the performance ratio of communication vs. com- 
putation for a given machine, the amount of computation performed before 
communicating to the next processor in the pipeline will have a direct effect 
on the overall performance of a pipelined computation [20,33]. 

A small experiment is first performed to determine the best pipeline gran- 
ularity for the static partitioning. A granularity of one (fine-grain) causes 
values to be communicated to waiting processors as soon as they are pro- 
duced. By increasing the granularity, more values are computed before com- 
municating, thereby amortizing the cost of establishing communication in 
exchange for some reduction in parallelism. In addition to the experimental 
data, compile-time estimates of the pipeline execution [33] are shown in Fig- 
ure 6.3. For the two machines, it can be seen that by selecting the appropriate 
granularity, the performance of the static partitioning can be improved. Both 
a fine-grain and the optimal coarse-grain static partitioning will be compared 
with the dynamic partitioning. 

The redistribution present in the dynamic scheme appears as three trans- 
poses^^ performed at two points within an outer loop (the exact points in the 
program can be seen in Figure 3.7). Since the sets of transposes occur at the 
same point in the program, the data to be communicated for each transpose 
can be aggregated into a single message during the actual transpose. As it 
has been previously observed that aggregating communication improves per- 
formance by reducing the overhead of communication [20,33], we will also 
examine aggregating the individual transpose operations here. 





(a) Intel Paragon (b) TMC CM-5 

Fig. 6.3. Coarse-grain pipelining for ADI2D 



This could have been reduced to two transposes at each point if we allowed 
the cuts to reorder statements and perform loop distribution on the innermost 
loops (between statements 17, 18 and 41, 42), as mentioned in Section 3, but 
these optimizations are not examined here. 
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In Figure 6.4, the performance of both dynamic and static partitionings 
for ADI2D is shown for an Intel Paragon and a Thinking Machines CM-5. 
For the dynamic partitioning, both aggregated and non-aggregated transpose 
operations were compared. For both machines, it is apparent that aggregating 
the transpose communication is very effective, especially as the program is 
executed on larger numbers of processors. This can be attributed to the 
fact that the start-up cost of communication (which can be several orders of 
magnitude greater than the per byte transmission cost) is being amortized 
over multiple messages with the same source and destination. 

For the static partitioning, the performance of the fine-grain pipeline was 
compared to a coarse-grain pipeline using the optimal granularity. The coarse- 
grain optimization yielded the greatest benefit on the CM-5 while still im- 
proving the performance (to a lesser degree) on the Paragon. For the Paragon, 
the dynamic partitioning with aggregation clearly improved performance (by 
over 70% compared to the fine-grain and 60% compared to the coarse-grain 
static distribution). On the CM-5, the dynamic partitioning with aggregation 
showed performance gains of over a factor of two compared to the fine-grain 
static partitioning but only outperformed the coarse-grain version for ex- 
tremely large numbers of processors. For this reason, it would appear that 
the limiting factor on the CM-5 is the performance of the communication. 

As previously mentioned in Section 3.4, the static partitioner currently 
makes a very conservative estimate for the execution cost of pipelined 
loops [15]. For this reason a dynamic partitioning was selected for both the 
Paragon as well as the CM-5. If a more accurate pipelined cost model [33] 
were used, a static partitioning would have been selected instead for the 
CM-5. For the Paragon, the cost of redistribution is still low enough that a 
dynamic partitioning would still be selected for large machine configurations. 

It is also interesting to estimate the cost of performing a single transpose 
in either direction (P 1 1 P) from the communication overhead present 

in the dynamic runs. Ignoring any performance gains from cache effects, the 
communication overhead can be computed by subtracting the ideal run time 




Processors Processors 

(a) Intel Paragon (b) TMC CM-5 

Fig. 6.4. Performance of ADI2D 
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(serial time divided by the selected number of processors) from the measured 
run time. Given that three arrays are transposed 200 times, the resulting 
overhead divided by 600 yields a rough estimate of how much time is required 
to redistribute a single array as shown in Table 6.1. 

From Table 6.1 it can be seen that as more processors are involved in the 
operation, the time taken to perform one transpose levels off until a certain 
number of processors is reached. After this point, the amount of data being 
handled by each individual processor is small enough that the startup over- 
head of the communication has become the controlling factor. Aggregating 
the redistribution operations minimizes this effect thereby achieving higher 
levels of performance than would be possible otherwise. 

Table 6.1. Empirically estimated time (ms) to transpose a 1-D partitioned 
matrix (512 512 elements; double precision) 





Intel Paragon 


TMC CM-5 


processors 


individual 


aggregated 


individual 


aggregated 


8 


36.7 


32.0 


138.9 


134.7 


16 


15.7 


15.6 


86.8 


80.5 


32 


14.8 


10.5 


49.6 


45.8 


64 


12.7 


6.2 


40.4 


29.7 


128 


21.6 


8.7 


47.5 


27.4 



6.3 Shallow Water Weather Prediction Benchmark 

Since not all programs will necessarily need dynamic distributions, we also 
examine another program which exhibits several different smaller phases of 
computation. The Shallow water benchmark is a weather prediction program 
using finite difference models of the shallow water equations [37] written by 
Paul Swarztrauber from the National Center for Atmospheric Research. 

As the program consists of a number of different functions, the program is 
first inlined since the approach for selecting data distributions is not yet fully 
interprocedural. Also, a loop which is implicitly formed by a GOTO statement 
is replaced with an explicit loop since the current performance estimation 
framework does not handle unstructured code. The final input program, ig- 
noring comments and declarations, resulted in 143 lines of executable code. 

In Figure 6.5, the phase transition graph is shown with the selected path 
using costs based on a 32 processor Intel Paragon with the original problem 
size of 257 257 limited to 100 iterations. The decomposition resulting in 

this graph was purposely bounded only by the productivity of the cut, and 
not by a minimum cost of redistribution in order to expose all potentially 
beneficial phases. This graph shows that by using the decomposition tech- 
nique presented in Figure 3.8, Shallow contains six phases (the length of the 
path between the start and stop node) with a maximum of four (sometimes 
redundant) candidates for any given phase. 
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Phase decomposition for Shallow 




Fig. 6.5. Phase transition graph and solution for Shallow 
(Displayed with the selected phase transition path and cummulative cost.) 



As there were no alignment conflicts in the program, and only BLOCK 
distributions were necessary to maintain a balanced load, the distribution of 
the 14 arrays in the program can be inferred from the selected configuration of 
the processor mesh. By tracing the path back from the stop node to the start, 
the figure shows that the dynamic partitioner selected a two-dimensional 
(8 4) static data distribution. Since there is no redistribution performed 

along this path, the loop peeling process previously described in Section 3.4 
is not shown on this graph as it is only necessary when there is actually 
redistribution present within a loop. 

As the communication and computation estimates are best-case approx- 
imations (they don’t take into account communication buffering operations 
or effects of the memory hierarchy), it is safe to say that for the Paragon, a 
dynamic data distribution does not exist which can out-perform the selected 
static distribution. Theoretically, if the communication cost for a machine 
were insignificant in comparison to the performance of computation, redis- 
tributing data between the phases revealed in the decomposition of Shallow 
would be beneficial. In this case, the dynamic partitioner performed more 
work to come to the same conclusion that a single application of the static 
partitioner would have. It is interesting to note that even though the dy- 
namic partitioner considers any possible redistribution, it will still select a 
static distribution if that is what is predicted to have the best performance. 
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Fig. 6.6. Performance of Shallow 

In Figure 6.6, the performance of the selected static 2-D distribution 
(block, block) is compared to a static 1-D row-wise distribution (BLOCK, 
) which appeared in some of the subphases. The 2-D distribution matches 
the performance of the 1-D distribution for small numbers of processors while 
outperforming the 1-D distribution for both the Paragon and the CM-5 (by 
up to a factor of 1.6 or 2.7, respectively) for larger numbers of processors. 

Kennedy and Kremer [24] have also examined the Shallow benchmark, 
but predicted that a one-dimensional (column-wise) block distribution was 
the best distribution for up to 32 processors of an Intel iPSC/860 hypercube 
(while also showing that the performance of a 1-D column-wise distribu- 
tion was almost identical to a 1-D row-wise distribution). They only con- 
sidered one-dimensional candidate layouts for the operational phases (since 
the Fortran D prototype compiler can not compile multidimensional distribu- 
tions [41]). As their operational definition already results in 28 phases (over 
four times as many in comparison to our approach), the complexity of the 
resulting 0-1 integer programming formulation will also only increase further 
when considering multidimensional layouts. 

7. Conclusions 

Dynamic data partitionings can provide higher performance than purely 
static distributions for programs containing competing data access patterns. 
The distribution selection technique presented in this chapter provides a 
means of automatically determining high quality data distributions (dynamic 
as well as static) in an efficient manner taking into account both the struc- 
ture of the input program as well as the architectural parameters of the 
target machine. Heuristics, based on the observation that high communica- 
tion costs are a result of not being able to statically align every reference 
in complex programs simultaneously, are used to form the communication 
graph. By removing constraints between competing sections of the program. 





Compiler Optimization of Dynamic Data Distributions for DMMs 481 



better distributions can potentially be obtained for the individual sections. 
If the resulting gains in performance are high enough in comparison to the 
cost of redistribution, dynamic distributions are formed. Communication still 
occurs, but data movement is now isolated into dynamic reorganizations of 
ownership as opposed to constantly obtaining any required remote data based 
on a (compromise) static assignment of ownership. 

A key requirement in automating this process is to be able to obtain 
estimates of communication and computation costs which accurately model 
the behavior of the program under a given distribution. Furthermore, by 
building upon existing static partitioning techniques, the phases examined as 
well as the redistribution considered are focused in the areas of a program 
which will otherwise generate large amounts of communication. 

In this chapter we have also presented an interprocedural data-flow tech- 
nique that can be used to convert between redistribution and distribution 
representations optimizing redistribution while maintaining the semantics of 
the original program. For the data partitioner, the framework is used to syn- 
thesize explicit redistribution operations in order to preserve the meaning of 
what the data partitioner intended in the presence of HPF semantics. For 
HPF programs, redistribution operations (implicitly specified by HPF se- 
mantics or explicitly specified by a programmer) that result in distributions 
which would not otherwise be used before another redistribution operation 
occurs are completely removed. 

Many HPF compilers that are currently available as commercial products 
or those that have been developed as research prototypes do not yet support 
transcriptive argument passing or the REDISTRIBUTE and REALIGN directives 
as there is still much work required to provide efficient support for the HPF 
subset (which does not include these features) . Since the techniques presented 
in this chapter can convert all of these features into constructs which are in 
the HPF subset (through the use of SDA), this framework can also be used 
to provide these features to an existing subset HPF compiler. 
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1. Introduction 

Distributed memory architectures have become popular as a viable and cost- 
effective method of building scalable parallel computers. However, the ab- 
sence of global address space, and consequently, the need for explicit message 
passing among processes makes these machines very difficult to program. This 
has motivated the design of languages like High Performance Fortran [14], 
which allow the programmer to write sequential or shared-memory parallel 
programs that are annotated with directives specifying data decomposition. 
The compilers for these languages are responsible for partitioning the compu- 
tation, and generating the communication necessary to fetch values of non- 
local data referenced by a processor. A number of such prototype compilers 
have been developed [3, 6, 19, 23, 29, 30, 33, 34, 43] . 

Accessing remote data is usually orders of magnitude slower than access- 
ing local data. This gap is growing because CPU performance is out-growing 
network performance, CPU’s are running relatively independent multipro- 
grammed operating systems, and commodity networks are being found more 
cost-effective. As a result, communication startup overheads tend to be astro- 
nomical on most distributed memory machines, although reasonable band- 
width can be supported for sufficiently large messages [36,37]. Thus compil- 
ers must reduce the number as well as the volume of messages in order to 
deliver high performance. The most common optimizations include message 
vectorization [23,43], using collective communication [18,30], and overlapping 
communication with computation [23]. However, many compilers perform lit- 
tle global analysis of the communication requirements across different loop 
nests. This precludes general optimizations, such as redundant communica- 
tion elimination, or carrying out extra communication inside one loop nest if 
it subsumes communication required in the next loop nest. 

This chapter presents a framework, based on global array data-flow anal- 
ysis, to reduce communication in a program. We apply techniques for par- 
tial redundancy elimination, discussed in the context of eliminating redun- 
dant computation by Morel and Renvoise [31], and later refined by other 
researchers [12,13,25]. The conventional approach to data-flow analysis re- 
gards each access to an array element as an access to the entire array. Previous 
researchers [16,17,35] have applied data-flow analysis to array sections to im- 
prove its precision. However, using just array sections is insufficient in the 
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context of communication optimizations. There is a need to represent infor- 
mation about the processors where the array elements are available, or need 
to be made available. For this purpose, we introduce a new kind of descriptor, 
the Available Section Descriptor (ASD) [21]. The explicit representation of 
availability of data in our framework allows us to relax the restriction that 
only the owner of a data item be able to supply its value when needed by 
another processor. An important special case occurs when a processor that 
needs a value for its computation does not own the data but has a valid value 
available from prior communication. In that case, the communication from 
the owner to this processor can be identified as redundant and eliminated, 
with the intended receiver simply using the locally available value of data. 
We show how the data flow procedure for eliminating partial redundancies is 
extended and applied to communication, represented using the ASDs. With 
the resultant framework, we are able to capture a number of optimizations, 
such as: 

— vectorizing communication, 

— eliminating communication that is redundant in any control flow path, 

— reducing the amount of data being communicated, 

— reducing the number of processors to which data must be communicated, 
and 

— moving communication earlier to hide latency, and to subsume previous 
communication. 

We do not know of any other system that tries to perform all of these opti- 
mizations, and in a global manner. Following the results presented in [13] for 
partially redundant computations, we show that the bidirectional problem of 
eliminating partial redundancies can be decomposed into simpler unidirec- 
tional problems, in the context of communication represented using ASDs as 
well. That makes the analysis procedure more efficient. We have implemented 
a simplified version of this framework as part of a prototype HPF compiler. 
Our preliminary experiments show significant performance improvements re- 
sulting from this analysis. 

An advantage of our approach is that the analysis is performed on the 
original program form, before any communication is introduced by the com- 
piler. Thus, communication optimizations based on data availability analysis 
need not depend on a detailed knowledge of explicit communication repre- 
sentations. 

While our work has been done in the context of compiling for distributed 
memory machines, it is relevant for shared memory machines as well. Shared 
memory compilers can exploit information about interprocessor sharing of 
data to eliminate unnecessary barrier synchronization or replace barrier syn- 
chronization by cheaper producer-consumer synchronization in the generated 
parallel code [20,32,39]. A reduction in the number of communication mes- 
sages directly translates into fewer synchronization messages. Another appli- 
cation of our work is in improving the effectiveness of block data transfer 
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operations in scalable shared memory multiprocessors. For scalable shared 
memory machines with high network latency, it is important for the underly- 
ing system to reduce the overhead of messages needed to keep data coherent. 
Using block data transfer operations on these machines helps amortize the 
overhead of messages needed for non-local data access. The analysis presented 
in this paper can be used to accurately identify sections of data which are 
not available locally, and which should be subjected to block data transfer 
operations for better performance. 

The rest of this chapter is organized as follows: Section 2 describes, using 
an example, the various communication optimizations that are performed by 
the data flow procedure described in this chapter. Section 3 describes our rep- 
resentation of the Available Section Descriptor and the procedure to compute 
the communication generated for a statement in the program. Section 4 dis- 
cusses the procedure for computing data flow information used in optimizing 
communications. In Section 5, we describe how the different communication 
optimizations are captured by the data flow analysis. Section 6 describes an 
extension to our framework to select a placement of communication that can 
reduce communication costs of a program even further. Section 7 presents the 
algorithms for the various operations on ASDs. Section 8 presents some pre- 
liminary results and in Section 9, we discuss related work. Finally, Section 10 
presents conclusions. 



2. Motivating Example 

We now illustrate various communication optimizations mentioned above us- 
ing an example. Figure 2.1(a) shows an HPF program and a high level view of 
communication that would be generated by a compiler following the owner- 
computes rule [23,43], which assigns each computation to the processor that 
owns the data being computed. Communication is generated for each non- 
local data value used by a processor in the computation assigned to it. The 
HPF directives specify the alignment of each array with respect to a tem- 
plate VPROCS, which is viewed as a grid of virtual processors in the context 
of this work. The variables a and z are two-dimensional arrays aligned with 
VPROCS, and d, e, and w are one-dimensional arrays aligned with the first 
column of VPROCS. In this example, we assume that the scalar variable s is 
replicated on all processors. The communication shown in Figure 2.1(a) al- 
ready incorporates message vectorization, a commonly used optimization to 
move communication out of loops. While message vectorization is captured 
naturally by our framework as we shall explain, in this chapter we focus on 
other important optimizations that illustrate the power of this framework. 
Our analysis is independent of the actual primitives (such as send-receive, 
broadcast) used to implement communication. We use the notation x(i) 
VPROCS(i, j) (the ranges of i and j are omitted to save space in the fig- 
ure) to mean that the value of x(i) is sent to the virtual processor position 
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VPROCS(i, j), iov eX\l i 100, 1 j 100. Reduced communication af- 
ter global optimization is shown in Figure 2.1(b). We consider optimizations 
performed for each variable d, e, and w. 

There are two identical communications for e in Figure 2.1(a), which re- 
sult from the uses of e in statements 10 and 26. In both cases, e(i) must 
be sent to VPROCSCi, j), for all values of i, j. However, because of the 
assignment to e(l) in statement 13, the second communication is only par- 
tially redundant. Thus, we can eliminate the second communication, except 
for sending e ( 1 ) to VPROCS ( 1 , j ) , for all values of j . This reduced commu- 
nication is hoisted to the earliest possible place after statement 13 in Figure 
2.1(b). 

In Figure 2.1(a), there are two communications for d, resulting from uses 
of d in statements 16 and 26. In Figure 2.1(b), the second communication 
has been hoisted to the earliest possible place, after statement 7, where it 
subsumes the first communication, which has been eliminated. 

Finally, there are two communications for w, resulting from uses of w 
in statements 16 and 28. The second, partially redundant, communication 
is hoisted inside the two branches of the if statement, and is eliminated 
in the then branch. The assignment to w(i) at statement 21 prevents the 
communication in the else branch from being moved earlier. 

The result of this collection of optimizations leads to a program in which 
communication is initiated as early as possible, and the total volume of com- 
munication has been reduced. 



3. Available Section Descriptor 

In this section, we describe the Available Section Descriptor (ASD), a repre- 
sentation of data and its availability on processors. When referring to avail- 
ability of data, we do not explicitly include information about the ownership 
of data, unless stated otherwise. This enables us to keep a close correspon- 
dence between the notions of availability of data and communication: in the 
context of our work, the essential view of communication is that it makes data 
available at processors which need it for their computation; the identity of the 
sender may be changed as long as the receiver gets the correct data. Hence, 
the ASD serves both as a representation of communication (by specifying 
the data to be made available at processors), and of the data actually made 
available by previous communications. Data remains available at receiving 
processors until it is modified by its owner or until the communication buffer 
holding non-local data is deallocated. Our analysis for determining the avail- 
ability of data is based on the assumption that a commTinication buffer is 
never deallocated before the last read reference to that data. This can be en- 
sured by the code generator after the analysis has identified read references 
which lead to communication that is redundant. Section 3.1 describes the 
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j) with VPR0CS(i,j) : : a, z 



HPF align (i, 

HPF align (i) with VPR0CS(i,l) 

1: do i = 1, 100 

5: e(i) = d(i) * w(i) 

6: d(i) = d(i) + 2 * w(i) 

7 : end do 

e{i) VPROCS{i,j) 

8: do i = 1, 100 

9: do j = 1, 100 

10: z(i,j) = e(i) 

1 1 : end do 

12: end do 

13: e(l) = 2 * d(l) 

14: if (s = 0) then 

d{i),w{i) VPROCS{i, 100) 

15: do i = 1, 100 

16: z(i,100) = d(i) / w(i) 

17 : end do 

18: else 

19: do i = 1, 100 

20: z(i,100) = m 

21 : w(i) = m 

22 : end do 

23: end if 

e{i),d(i) — > VPROCS(i,j) 
w{i) VPROCS{i, 100) 

24: do j = 1, 100 

25: do i = 1, 100 

26: a(i,j) = a(i,j) + 

(d(i) ^e(i))/z(i, j) 

27 : end do 

28: z(j , 100) = w(j) 

29: end do 



: : d, e , w 

do i = 1, 100 

e(i) = d(i) * w(i) 
d(i) = d(i) + 2 * w(i) 
end do 

e{i),d{i) VPROCS{i,j) 
do i = 1, 100 
do j = 1, 100 
z(i,j) = e(i) 
end do 
end do 

e(l) = 2 * d(l) 
e(l) ^ VPROCS{l,j) 
if (s = 0) then 
w{i) VPROCS{i, 100) 
do i = 1, 100 

z(i,100) = d(i) / w(i) 
end do 
else 

do i = 1,100 
z(i, 100) = m 
w(i) = m 
end do 

w{i) VPROCS{i, 100) 
end if 



do j = 1, 100 
do i = 1, 100 

a(i,j) = a(i,j) + 

(d(i) * e(i))/z(i,j) 

end do 

z(j ,100) = w(j) 
end do 



(a) (b) 



Fig. 2.1. Program before and after communication optimizations. 
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ASD representation. Section 3.2 describes how the communication generated 
at a statement is computed in terms of an ASD representation. 

3.1 Representation of ASD 

The ASD is defined as a pair |D, M where D is a data descriptor, and M 
is a descriptor of the function mapping elements in D to virtual processors. 
Thus, M (D) refers collectively to processors where data in D is available. For 
an array variable, the data descriptor represents an array section. For a scalar 
variable, it consists of just the name of the variable. Many representations 
like the regular section descriptor (RSD) [9], and the data access deseriptor 
(DAD) [5] have been proposed in the literature to summarize array sections. 
Our analysis is largely independent of the actual descriptor used to represent 
data. 

For the purpose of analysis in this work, we shall use the bounded regular 
section descriptor (BRSD) [22], a special version of the RSD, to represent 
array sections and treat scalar variables as degenerate cases of arrays with 
no dimensions. Bounded regular sections allow representation of subarrays 
that can be specified using the Fortran 90 triplet notation. We represent a 
bounded regular section as an expression A(5'), where A is the name of an 
array variable, 5 is a vector of subscript values such that each of its elements 
is either (i) an expression of the form a -At + (3, where A: is a loop index 
variable and a and j3 are invariants, (ii) a triple I : u : s, where I, u, and s 
are invariants (the triple represents the expression discussed above expanded 
over a range) , or (iii) indicating no knowledge of the subscript value. 

The processor space is regarded as an unbounded grid of virtual proces- 
sors. The abstract processor space is similar to a template in High Perfor- 
mance Fortran (HPF) [14], which is a grid over which different arrays are 
aligned. The mapping function descriptor M is a pair '\P, F[, both P and 
F being vectors of length equal to the dimensionality of the processor grid. 
The Ah element of P (denoted as P*) indicates the dimension of the array 
A that is mapped to the Ah grid dimension, and P* is the mapping function 
for that array dimension, i.e., P*(j) returns the position(s) along the Ah grid 
dimension to which the jth element of the array dimension is mapped. We 
represent a mapping function, when known statically, as 

F"{j) = {c^ + l : c^ + u : s) 

In the above expression, c, I, u and s are invariants. The parameters c, I and u 
may take rational values, as long as P*(j) evaluates to a range over integers, 
over the data domain. The above formulation allows representation of one-to- 
one mappings (when I = u), one-to- many mappings (when u yAl + s), and also 
constant mappings (when c = 0) . The one-to- many mappings expressible with 
this formulation are more general than the replicated mappings for ownership 
that may be specified using HPF [14]. Under an HPF alignment directive. 
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the jth element of array dimension P* may be mapped along the zth grid 
dimension to position c — 3 ’ + o or which represents all positions in that 
grid dimension. 

If an array has fewer dimensions than the processor grid (this also holds 
for scalars, which are viewed as arrays with no dimensions), there is no array 
dimension mapped to some of the grid dimensions. For each such grid dimen- 
sion m, P™ takes the value /r, which represents a “missing” array dimension. 
In that case, P™ is no longer a function of a subscript position. It is simply an 
expression of the form I : u : s, and indicates the position(s) in the mth grid 
dimension at which the array is available. As with the usual triplet notation, 
we shall omit the stride, s, from an expression when it is equal to one. When 
the compiler is unable to infer knowledge about the availability of data, the 
corresponding mapping function is set to We also define a special, uni- 
versal mapping function descriptor U, which represents the mapping of each 
data element on all of the processors. 

Example. Consider a 2-D virtual processor grid VPROCS, and an ASD ]A[2 : 
100 : 2,1 : 100), t[l, 2], [P^, p2]||, where F^{i) = = 1 : 100. 

The ASD represents an array section A(2 : 100 : 2, 1 : 100), each of whose 
element A{2 -A,j) is available at a hundred processor positions given by 
VPROCS (2 -^(g)l, 1 : 100). This ASD is illustrated in Figure 3.1. Figure 3.1(a) 
shows the array A, where each horizontal stripe Ai represents A{2^, 1 : 100). 
Figure 3.1(b) represents the mapping of the array section onto the virtual 
processor template VPROCS, where each subsection Ai is replicated along 
its corresponding row. 







Al 1 AT IaI I • • • 








A2 Ia2 Ia2 I ••• 


’mmmmii’-’mmmm. 
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Array A 

(a) 




Processor Grid 

(b) 



Fig. 3.1. Illustration of ASD 
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3.2 Computing Generated Communication 

Given an assignment statement of the form Ihs = rhs, we describe how com- 
munication needed for each reference in the rhs expression is represented 
as an ASD. This section deals only with communication needed for a single 
instance of the assignment statement, which may appear nested inside loops. 
The procedure for summarizing communication requirements of multiple in- 
stances of a statement with respect to the surrounding loops is discussed in 
the next section. We shall describe our procedure for array references with 
arbitrary number of dimensions; references to scalar variables can be viewed 
as special cases with zero dimensions. This analysis is applied after any array 
language constructs (such as Fortran 90 array assignment statements) have 
been scalarized into equivalent Fortran 77-like assignment statements appear- 
ing inside loops. Each subscript expression is assumed to be a constant or an 
affine function of a surrounding loop index. If any subscript expressions are 
non-linear or coupled, then the ASD representing that communication is set 
to and is conservatively underestimated or overestimated, based on the 
context. 

As remarked earlier, the identity of senders is ignored in our representa- 
tion of communication. The ASD simply represents the intended availability 
of data to be realized via the given communication, or equivalently, the avail- 
ability of data following that communication. Clearly, that depends on the 
mapping of computation to processors. In this work, we determine the gen- 
eration of communication based on the owner computes rule, which assigns 
the computation to the owner of the Ihs. The algorithm can be modified 
to incorporate other methods of assigning computation to processors [II], as 
long as that decision is made statically. 

Let Dl he the data descriptor for the Ihs reference and Ml = ]Pl, FlJ, 
be the mapping function descriptor representing the ownership of the Ihs 
variable (this case represents an exception where the mapping function of 
ASD corresponds to ownership of data rather than its availability at other 
processors). is directly obtained from the HPF alignment information 
which specifies both the mapping relationship between array dimensions and 
grid dimensions (giving Pl) and the mapping of array elements to grid posi- 
tions (giving Fl), as described earlier. We calculate the mapping of the rhs 
variable 'tDji, M^l that results from enforcing the owner computes rule. The 
new ASD, denoted CGEN, represents the rhs data aligned with Ihs. The 
regular section descriptor Dji represents the element referenced by rhs. The 
mapping descriptor Mij = is obtained by the following procedure: 

Step 1. Align array dimensions with processor grid dimensions: 

pi 

1. For each processor grid dimension i, if the Ihs subscript expression, , 
in dimension has the form ol\ Mi -\- j3\ and there is a rhs subscript 
expression = «2 Mt + (32, for the same loop index variable k, set Pr 
to the rhs subscript position n. 
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2. For each remaining processor grid dimension i, set to j, where j is an 
unassigned rhs subscript position. If there is no unassigned rhs subscript 
position left, set Pk to 

Step 2. Calculate the mapping function for each grid dimension: 

For each processor grid dimension i, let Ffij) = + o be the ownership 

mapping function of the Ihs variable (c and o are integer constants, with the 
exception of replicated mapping, where c = 0 and o represents the range of 
all positions in that grid dimension). We determine the rhs mapping func- 
tion Ffiij) from the Ihs and the rhs subscript expressions corresponding 
respectively to dimensions and Pf. The details are specified in Table 1. 
The first entry in Table 1 follows from the fact that element j = a 2 ~k + [32 

pi 

ofSk- is made available at grid position c + (3\) + o along the fth 

dimension; substituting k by {j 0 /? 2 )/a 2 directly leads to the given result. 
The second and the third entries correspond to the special cases when the rhs 
dimension has a constant subscript or there is no rhs dimension mapped to 
grid dimension i. The last entry represents the case when there is no Ihs array 
dimension mapped to grid dimension i. In that case, the mapping function 
of the Ihs variable must have c = 0. 







Fr{3) 


(Xi — h f3\ 


CX2 — k -|- /32, Oi2 = 0 




cti — k -p /3i 


P 2 


c -Ai + (3\) + 0 


Oil — h “h f3i 


“missing” 


c —k + [3i) + o 


“missing” 


Qf2 — k -|- [32 


0 (c must be 0) 



Table 3.1. Mapping function calculation based on the owner computes rule 



Example. Consider the assignment statement in the code fragment: 

HPF ALIGN A{i,j) WITH VPR0CS(f,j) 

«< 

A{i,j) = . ..B{2 

The ownership mapping descriptor for the Ihs variable A is j)!, 2], P^J, 
where Flfi) = i and Ff^[j) = j. This mapping descriptor is derived from the 
HPF alignment specification. Applying Step 1 of the compute rule algorithm, 
Pr is set to [1,2], that is, the first dimension of VPROCS is aligned with the 
first dimension of B, and the second dimension of VPROCS is aligned with 
the second dimension of B. 

The second step is to determine the mapping function Fr. For the first grid 
dimension. Pi corresponds to the subscript expression i and Pr corresponds 
to the subscript expression 2 -A. Therefore, using Fl and the first rule in 
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Table 1, F^{i) is set to (l*(l*i - 0)/2) + 0) + 0 = i/2. For the second grid 
dimension, corresponds to the subscript expression], and corresponds 
to the subscript expression j0l. Using P/ and the first rule in Table 1, F^{j) 
is set to j + 1. The mapping descriptor thus obtained maps B(2*i, j-1) onto 
VPROCS(i, j). 

4. Data Flow Analysis 

In this section, we present a procedure for obtaining data flow information re- 
garding communication for a structured program. The analysis is performed 
on the control flow graph representation [1] of the program, in which nodes 
represent computation, and edges represent the flow of control. We are able 
to perform a collection of communication optimizations within a single frame- 
work, based on the following observations. Determining the data availability 
resulting from communication is a problem similar to determining available 
expressions in classical data flow analysis. Thus, optimizations like eliminat- 
ing and hoisting communications are similar to eliminating redundant expres- 
sions and code motion. Furthermore, applying partial redundancy elimination 
techniques at the granularity of sections of arrays and processors enables not 
merely elimination, but also reduction in the volume of communication along 
different control flow paths. 

The bidirectional data-flow analysis for suppression of partial redun- 
dancies, introduced by Morel and Renvoise [31], and refined subsequently 
[12,13,25], defines a framework for unifying common optimizations on avail- 
able expressions. We adapt this framework to solve the set of communication 
optimizations described in Section 2. This section presents the following re- 
sults. 

— Section 4.1 reformulates the refined data-flow equations from [13] in terms 
of ASDs. We have incorporated a further modification that is useful in the 
context of optimizing communication. 

— Section 4.2 shows that the bidirectional problem of determining the possible 
placement of communication can be solved by obtaining a solution to a 
backward problem, followed by a forward correction. 

— In contrast to previous work, solving these equations for ASDs requires 
array data-flow analysis. In Section 4.3, we present the overall data-flow 
procedure that uses interval analysis. 

As with other similar frameworks, we require the following edge-splitting 
transformation to be performed on the control flow graph before the analysis 
begins: any edge that runs directly from a node with more than one successor, 
to a node with more than one predecessor, is split [13]. This transformation 
is illustrated in Figure 4.1. Thus, in the transformed graph, there is no direct 
edge from a branch node to a join node. 
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Fig. 4.1. Edge splitting transformation 



4.1 Data Flow Variables and Equations 

We use the following definitions for data-flow variables representing informa- 
tion about communication at different nodes in the control flow graph. Each 
of these variables is represented as an ASD. 

ANTLOCi : communication in node i, that is not preceded by a definition in 
node i of any data being communicated (i.e., local communication that may 
be anticipated at entry to node i). 

CGENi : communication in node i, that is not followed by a definition in 
node i of any data being communicated. 

KILLi : data being killed (on all processors) due to a definition in node i. 

AVINi/AVOUTi : availability of data at the entry/exit of node i. 

PPINi/ PPOUTi : variables representing safe placement of communication 
at the entry/exit of node i, with some additional properties (described later 
in this section). 

IN SERTi : communication that should be inserted at the exit of node i. 
REDU NDi : communication in node i that is redundant. 

Local Data Flow Variables For an assignment statement, both ANT LOG 
and GGEN are set to the communication required to send each variable ref- 
erenced on the right hand side (rhs) to the processor executing the statement. 
That depends on the compute-rule used by the compiler in translating the 
source program into SPMD form. Consider the program segment from the 
example in Section 3.2: 

HPF ALIGN A{i,j) WITH VPR0CS(f,j) 

«< 

A{i,j) = . ..B{2 

Using the procedure in Section 3.2, we compute the communication 
necessary to send B(2*i, j-1) to the processor executing the statement as: 
GGEN = ANT LOG = ]P{2 0 1), T[l, 2], [Fi, FajU, where Fi(i) = i/2, 
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and F 2 H) = j + \. The KILL variable for the statement is set to 
signifying that A(i,j) is killed on all processors. The procedure for deter- 
mining CGEN, ANT LOG , and KILL for nodes corresponding to program 
intervals shall be discussed in Section 4.3. 

Global Data Flow Variables The data flow equations, as adapted from 
[13], are shown below. ^ AVIN is defined as =>for the entry node, while 
PPOUT is defined as =>for the exit node, and initialized to fi for all other 
nodes. 



AVOUTi = [AVINi® KILLi]i}.GGENi (4.1) 

AVINi = AVOUTp (4.2) 

p pred{i) 

PPINi = [{PPOUTi 0 KILLi) JJ. ANTLOGi] ^ 

[ {AVOUTp PPOUTp)] (4.3) 

p pred{i) 

PPOUTi = PPINs (4.4) 

s succ{i) 

INSERTi = [PPOUTi 0 AVOUTi] 0 [PPINi 0 KILLi] (4.5) 
REDUNDi = PPINi ^ANTLOGi (4.6) 



The problem of determining the availability of data (AVINi/AVOUTi) is 
similar to the classical data-flow problem of determining available expressions 
[1]. This computation proceeds in the forward direction through the control 
flow graph. The first equation ensures that any data overwritten inside node 
i is removed from the availability set, and data communicated during node 
i (and not overwritten later) is added to the availability set. The second 
equation indicates that at entry to a join node in the control flow graph, only 
the data available at exit on each of the predecessor nodes can be considered 
to be available. 

We now consider the computation of P PIN / PPOUT . The term 
[{PPOUTi 0 KILLi) JJ- ANT LOGi] in Equation 4.3 denotes the part of 
communication occurring in node i or hoisted into it that can legally be 
moved to the entry of node i. A further intersection of that term with 
["t^ pred{i){AVOUTp JJ- PPOUTp)] gives an additional property to PPINi, 

^ The original equation in [13] for PPINi has an additional term, correspond- 
ing to the right hand side being further intersected with P AVINi, the partial 
availability of data at entry to node i. This term is important in the context of 
eliminating partially redundant computation, because it prevents unnecessary 
code motion that increases register pressure. However, moving communication 
early can be useful even if it does not lead to a reduction in previous communi- 
cation, because it may help hide the latency. Hence, we drop that term in our 
equation for PPINi. 
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namely that all data included in PPINi must be available at entry to node i 
on every incoming path due to original or moved communication. PPOUTi 
is set to communication that can be placed at entry to each of the successor 
nodes to i, as shown by Equation 4.4. Thus, PPOUTi represents communi- 
cation that can legally and safely appear at the exit of node i. The property 
of safety implies that the communication is necessary, regardless of the flow 
of control in the program. Hence, the compiler avoids doing any speculative 
communication in the process of moving communication earlier. 

As Equations 4.3 and 4.4 show, the value of PPINi for a node i is not 
only used to compute PPOUTp for its predecessor node p, but it also depends 
on the value of PPOUTp. Hence, this computation represents a bidirectional 
data flow problem. 

Einally, INSERTi represents communication that should be inserted at 
the exit of node z as a result of the optimization. Given that PPOUR repre- 
sents safe communication at that point, as shown in Equation 4.5, INSERTi 
consists of PPOUTi minus the following two components: (i) data already 
available at exit of node i due to original communication: given by AVOUR, 
and (ii) data available at entry to node i due to moved or original communi- 
cation, and which has not been overwritten inside node i: this component is 
given by {PPINi ® KILLi). Following the insertions, any communication in 
node i that is not preceded by a definition of data (i.e., ANTLOCi) and which 
also forms part of PPINi becomes redundant. This directly follows from the 
property of PPINi that any data included in PPINi must be available at 
entry to node i on every incoming path due to original or moved communica- 
tion. Thus, in Equation 4.6, REDU NDi represents communication in node 
i that can be deleted. 

The union, intersection, and difference operations on ASDs are described 
later in the chapter, in Section 7. The ASDs are not closed under these oper- 
ations (the intersection operation is always exact, except in the special case 
when two mapping functions, of the form Pi{j) = c ^ + I : c ^ + u : s, 
for corresponding array dimensions have different values of the coefficient 
c). Therefore, it is important to know for each operation whether to un- 
derestimate or overestimate the result, in case an approximation is needed. 
In the above equations, each of AVINi, AVOUTi, PPINi, PPOUR, and 
REDU N Di are underestimated, if necessary. On the other hand, INSERR is 
overestimated, if needed. This ensures that the compiler does not incorrectly 
eliminate communication that is actually not redundant. While the over- 
estimation of INSERTi or underestimation of REDU N Di can potentially 
lead to more communication than necessary, our framework has some built-in 
guards against insertion of extra communication relative to the unoptimized 
program. The Morel-Renvoise framework [31] and its modified versions en- 
sure that PPINi and PPOUTi represent safe placements of computation 
at the entry/exit of node i. Correspondingly, in the context of our work, 
PPINi/ PPOUTi does not represent more communication than necessary. 
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4.2 Decomposition of Bidirectional Problem 

Before we describe our data-flow procedure using the above equations, we 
need to resolve the problem of bidirectionality in the computation of PPINi 
and PPOUTi. Solving a bidirectional problem usually requires an algorithm 
that goes back and forth until convergence is reached. A preferable approach 
is to decompose the bidirectional problem, if possible, into simpler unidirec- 
tional problcm(s) which can be solved more efficiently. 

Dhamdhere et al. [13] prove some properties about the bidirectional prob- 
lem of eliminating redundant computation, and also prove that those prop- 
erties are sufficient to allow the decomposition of that problem into two uni- 
directional problems. One of those properties, distributivity, does not hold 
in our case, because we represent data-flow variables as ASDs rather than 
bit strings, and the operations like union and difference are not exact, unlike 
the boolean operations. However, we are able to prove directly the following 
theorem: 

Theorem 41. The bidirectional problem of determining PPINi o,nd 
PPOUTi, as given by Equations 4-3 and 4-4> can be decomposed into a back- 
ward approximation, given by Equations ^.7 and 4-8, followed by a forward 
correction, given by Equation 4-9. 

BA_PPINi = {PPOUTi 0 KILLi) ANTLOCi (4.7) 

PPOUTi = BA.PPINs (4.8) 

s succ{i) 

PPINi = BA_PPINi {AVOUTp J) PPOUTp)] (4.9) 

p pred{i) 

Proof : BA_PPINi represents a backward approximation to the value of 
PPINi (intuitively, it represents communication that can legally and safely 
be moved to the entry of node i). We will show that the correction term 
("t^ pred(i){^UOUTpl4 PPOUTp)) applied to a node i to obtain PPINi can- 
not lead to a change in the value of PPOUT for any node in the control flow 
graph, and that in turn implies that the PPIN values of other nodes are also 
unaffected by this change. 

The correction term, being an intersection operation, can only lead to a 
reduction in the value of the set PPINi. Let X = BA_PPINi 0 PPINi 
denote this reduction, and let x denote an arbitrary element of X . Thus, 
X \BA_PPINi, and x 'NPPINi. Hence, there must exist a predecessor of 
i, say, node p (see Figure 4.2), such that: x \ AVOUTp and x \PPOUTp. 
Therefore, p must have another child j such that x \BA_PPINj, otherwise 
X would have been included in PPOUTp. Now let us consider the possible 
effects of removal of x from PPINi. From the given equations, a change in 
the value of PPINi can only affect the value of PPOUT for a predecessor of 
i (which can possibly lead to other changes). Clearly, the value of PPOUTp 
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Fig. 4.2. Proving the decomposition of the bidirectional problem 



does not change because PPOUTp already does not include x. But node i 
cannot have any predecessors other than p because p is a branch node, and 
by virtue of the edge splitting transformation on the control flow graph, there 
can be no edge from a branch node to a join node. Hence, the application 
of the correction term at a node i cannot change the PPOUT value of any 
node: this implies the validity of the above process of decomposing the bidi- 
rectional problem. 

We observe that since the application of the correction term to a node 
does not change the value of PPOUT or PPIN of any other node, it does not 
require a separate pass through the control flow graph. During the backward 
pass itself, after the value of PPOUTp is computed for a node p, the correction 
term can be applied to its successor node i by intersecting BA_PPINi with 
AVOUTp |1 PPOUTp. 

4.3 Overall Data-Flow Procedure 

So far we have discussed the data flow equations that are applied in a forward 
or backward direction along the edges of a control flow graph to determine 
the data flow information for each node. In the presence of loops, which 
lead to cycles in the control flow graph, one approach employed by classical 
data flow analysis is to iteratively apply the data flow equations over the 
nodes until the data flow solution converges [1] . We use the other well-known 
approach, interval-analysis [2,17], which makes a definite number of traversals 
through each node and is well-suited to analysis such as ours which attempts 
to summarize data flow information for arrays. 

We use Tarjan intervals [38], which correspond to loops in a structured 
program. Each interval in a structured program has a unique header node h. 
As a further restriction, we require each interval to have a single loop exit node 
1. Each interval has a back-edge 'll, h|. The edge-splitting transformation. 
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discussed earlier, adds a node b to split the back-edge hi into two edges, 
|/, bi and |5, /i|. We now describe how interval analysis is used in the overall 
data flow procedure. 

INTERVAL ANALYSIS Interval analysis is precisely deflned in [7]. The 
analysis is performed in two phases, an elimination phase followed by a prop- 
agation phase. The elimination phase processes the program intervals in a 
bottom-up (innermost to outermost) traversal. During each step of the elim- 
ination phase, data flow information is summarized for inner intervals, and 
each such interval is logically collapsed and replaced by a summary node. 
Thus, when an outer interval is traversed, the inner interval is represented by 
a single node. At the end of the elimination phase, there are no more cycles 
left in the graph. For the purpose of our description, the top-level program 
is regarded as a special interval with no back-edge, which is the first to be 
processed during the propagation phase. Each step of the propagation phase 
expands the summary nodes representing collapsed intervals, and computes 
the data flow information for nodes comprising those intervals, propagating 
information from outside to those nodes. 

Our overall data flow procedure is sketched in Figure 4.3. We now provide 
details of the analysis. 



for each interval in elimination phase (bottom-up) order do 

1. Compute CGEN and KILL summary in forward traversal of the inter- 
val. 

2. Compute ANTLOC summary in backward traversal of the interval. 
for each interval in propagation phase (top-down) order do 

1. Compute AVIN and AVOUT for each node in a forward traversal of 
the interval. 

2. Compute PPOUT and BA_PPIN for each node in a backward traversal 
of the interval. Once PPOUT is computed for a node in this traversal, 
apply the forward correction to BA_PPIN of each of its successor nodes. 
Once PPIN is obtained for a node via the forward correction(s), deter- 
mine INSERT and REDUND for that node as well. 

Fig. 4.3. Overall data flow procedure 



4.3.1 Elimination Phase. We now describe how the values of local data- 
flow variables, CGEN , KILL, and ANTLOC are summarized for nodes cor- 
responding to program intervals in each step of the elimination phase. These 
values are used in the computations of global data-flow variables outside that 
interval. 
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The computation of KILL and CGEN proceeds in the forward direction, 
i.e., the nodes within each interval are traversed in topological-sort order. For 
the computation of KILL, we define the variable Ki as the data that may be 
killed along any path from the header node h to node i. We initialize the data 
availability information and the kill information (K^) at the header node as 
follows: 



AVINh = ^ 

Kh = KILLh 

The transfer function for Ki at all other nodes is defined as follows: 

Ki = ( Kp)\^KILLi (4.10) 

p pred{i) 




Fig. 4.4. Computing summary information for an interval 

The transfer functions given by Equations 4.1, 4.2 and 4.10 are then ap- 
plied to each statement node during the forward traversal of the interval, as 
shown in Figure 4.4. Finally, the data availability generated for the interval 
last node I must be summarized for the entire interval, and associated with 
a summary node s. However, the data availability at I, obtained from Equa- 
tions 4.1 and 4.2, is only for a single iteration of the loop. Following [17], we 
would like to represent the availability of data corresponding to all iterations 
of the loop. 
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Definition. For an ASD set S, and a loop with index k varying from low to 
high, expand{S, k, low : high) is a function which replaces all single data 
item references a — + (3 used in any array section descriptor Z? in S' by the 
triple {a -4ow + fi '■ a -diigh + /3 : a), and any mapping function of the form 
Pi{j) = c ^ + o hy Fi{j) = c -4ow + o : c -^igh + o : c. 

The following equations define the transfer functions which summarize 
the data being killed and the data being made available in an interval with 
loop index k, for all iterations low : high. 

KILLs = expand{Ki,k,low : high) 

CGENs = expand(AyO[/Ti, k, low : high) 0 

{i).AntiDepDefexpand{AntiDepDef,k, low : high)) 

where AntiDepDef represents each definition in the interval loop that is 
the target of an anti-dependence at the loop nesting level (we conclude that 
a given dependence exists at a loop nesting level m if the direction vector 
corresponding to direction ‘=’ for all outer loops and direction ‘<’ or ‘=’ for 
the loop at level m is included in the direction vector representation [42] of 
that dependence). If the loop is a doall loop, the range of AntiDepDef is 
empty, so that we get CGENg = expand( AFOf/T;, k, low : high). 

Explanation. For computing the interval kill set KILLg, we simply expand 
the kill set generated at I over the interval loop bounds low : high. Computing 
the interval availability set GGENg requires more work, because a variable 
definition in a particular iteration may kill data made available in previous 
iterations of the loop. Therefore, we first expand data made available in a 
single iteration, obtaining all data made available in any iteration, and then 
subtract out data that may be killed after it is made available. A definition 
kills data made available in a previous or the same iteration of a loop if it is 
the target of an anti-dependence at the loop nesting level, that is, if it defines 
data previously used. For example, consider the loop shown in Figure 4.5. 
There is a loop-carried anti-dependence from the reference A(i) to A(i 0 1). 
The data made available during the Aloop due to the reference to A{i) is 
obtained as: 

GGENs = expand(tA(i), M|, z, 2 : 99) 0 expand(tA(i 0 1), WJ,, z, 2 : 99) 
= T^(2 : 99),Mi0TA(l : 98),Wi 
= T^(99),M| 

where M = t[0], [Ti]|, where Fi{j) = j. 

The computation of ANT LOG proceeds in the backward direction, z.e., 
the nodes within each interval are traversed in reverse topological-sort or- 
der. The computation of ANT LOG uses the BA_PPINi and PPOUTi data 
flow sets that respectively represent communication that can be safely antic- 
ipated at (moved to) the entry and the exit of node i. The anticipatability of 
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HPF ALIGN WITH VPROCS(f) 

do i = 2 , 99 

A{i (g) 1) = . . . 

B{i 0 1) = A{i) 

A{i + 1) = ... 
enddo 

Fig. 4.5. Example to illustrate interval analysis 



communication at exit from the last node, PPOUTi, is initialized to ^The 
transfer functions given by Equations 4.7 and 4.8 are then applied to each 
statement during the backward traversal of the interval. Finally, BA_PPINh 
represents the anticipatability of communication for the interval body for a 
single loop iteration. The communication for the entire interval that precedes 
the data definition is summarized by “expanding” on BA_PPINh, and sub- 
tracting the ASD corresponding to the expanded form of all definitions that 
are sources of a flow dependence at the loop nesting level (since those defini- 
tions correspond to data computed in previous iterations of the loop before 
being communicated): 

ANTLOCs = expand{BA_PPINh,k, low : high) 0 

{i)-FiowDepDefexpand{FlowDepDef,k, low : high)) 

For the example shown in Figure 4.5, there is a flow-dependence from A(i-\-l) 
to the reference A{i). For the communication generated by the reference to 
A{i), we obtain: 

ANTLOCs = expand(tA(f), M|, 2 : 99) 0 expand(t^(f -|- 1), WJ,, i, 2 : 99) 
= T^(2 : 99),M|0T^(3 : 100),Wi 
= ]A{2),M[ 

where M = t[0], [Fi]|, where Fi{j) = j. 

4.3.2 Propagation Phase. The propagation phase processes the program 
intervals in a top-down (outermost to innermost) traversal. During the ex- 
pansion of an interval, data flow information from outside is propagated to 
nodes inside that interval. During this traversal of nodes in an interval, any 
inner interval is treated as a single node represented by its summary node. 

Given an interval representing a loop, our analysis calculates the data flow 
information for each node for a loop iteration k. For the forward traversal 
determining the solutions to AVIN/AVOUT, the value of AVINh at the 
beginning of the /cth loop iteration is given by: 

AVIN^ = [AVIN]^'" 0 expand(ATi, k, low -.k® 1)] J) 

[expand( AFOf/Tp fc, low : /c 0 1) 0 
{i)-AntiDepDefexpand{AntiDepDef, k, low : k 0 1))] 
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Explanation. AVINj^^ represents data that is available at entry to the 
header before the loop is entered (this is the information that is propagated 
from outside the interval) . The data available at the beginning of iteration k 
consists of: 

1. The data made available before the loop is entered, which is not killed in 
iterations low \ k and 

2. The data made available on all previous iterations low : fc 0 1, which has 
not been killed before iteration k. 

The two terms unioned together in the above equation for AVINj^ correspond 
to these two components. In a similar manner, for the backward traversal 
obtaining the solutions to BA_PPIN/ PPOUT , the value of PPOUTi at the 
beginning of the /cth loop iteration is obtained as: 

PPOUTi^ = [PP0UT}°^ 0 expand(Ki, k, low -.k® 1)] 

[expand{BA_PPINh, k, low : /c 0 1) 0 
{il-FiowDepDefexp&nd{FlowDepDef, k, low : k 0 1))] 



Once again, for the example shown in Figure 4.5, using the same definition 
of M as before, we get: 



AVIN^ 

PPOUTi^ 



]A{2 : fc0l). Mi 01^(1 : A:0 2),W| 
T^(A:0 l),Mi 

T^(2:fc0l),Mi0T^(3:A:),Wi 

T^(2),Mi 



As noted earlier, the first step in the propagation phase of our algorithm is 
different from others, in that it is applied over a special interval representing 
the top-level program rather than a loop. For that step, the values of AVINh 
for the entry node h and of PPOUTi for the exit node I are initialized to => 
For each interval processed in the propagation phase, after the initial de- 
termination of AVINh in the forward traversal, the transfer functions given 
by Equations 4.1 and 4.2 are applied to obtain AVOUT and AVIN values 
for the other nodes. Similarly, during the backward traversal, after deter- 
mining PPOUTi for the last node I, Equations 4.7 and 4.8 are applied to 
obtain BA_PPIN and PPOUT values for the remaining nodes. Eurther- 
more, during this backward traversal, after computing PPOUTp for node p, 
the forward correction given by Equation 4.9 is applied to PPINi for each 
successor node i, as discussed earlier. Following the determination of PPINi 
(which is complete after the last forward correction has been applied from its 
predecessor(s)), the values of INSERTi and REDUNDi are obtained using 
Equations 4.5 and 4.6 respectively. 




A Framework for Global Communication Analysis and Optimizations 505 



5. Communication Optimizations 

Following the determination of INSERT and REDUND for each node, 
communications corresponding to the values of INSERT are placed at the 
exits of nodes, and the values of REDUND are used to delete redundant 
communication. We now describe how different optimizations are captured 
by the data flow procedure that we have described. Message vectorization 
is accounted for by the computation of ANT LOG for an entire interval, as 
it characterizes the communication that can be moved outside a loop. Since 
message vectorization is a well-understood optimization implemented by most 
distributed memory compilers based on data-dependence [19,23,29,30,34,43], 
we shall focus on other important optimizations that require the generality 
of data-flow analysis. 

Both of the equations for determining INSERT and REDUND in- 
herently capture the elimination of redundant communication. When com- 
munication is moved and inserted at some other place, the available data 
(AVOUT) and the communication corresponding to (PPIN 0 KILL) is 
subtracted from it, as shown by Equation 4.5. REDUNDi refers not only to 
communication made redundant by the availability of data due to some other 
communication, but also serves to remove the original communication which 
has been moved to a different point (and which appears in the INSERT 
term at its new place). 

During the remainder of our discussion, we shall refer to the original com- 
munication being optimized as COMM = and to the redundant 

part of the communication being deleted as DELETE = 1.^2, M 2 J,. This 
redundant part could correspond to REDUND or to the communication 
being subtracted in the equation for INSERT. We shall use the notation 
D VPROCS{M{D)) to represent the communication ]'D,M[. 



5.1 Elimination of Redundant Communication 

If COMM ./ DELETE, i.e., if Di / D 2 , and Mi{Di) / M 2 {Di), the 
communication of Di to processors Mi(Di) is redundant because the data is 
already available at those processors under the mapping function M 2 . Hence, 
COMM can be eliminated. 

For example, in Figure 2.1 after message vectorization, the communication 
for d just before statement (15) is COMM = d(i) V PROCS{i, 100), f = 

1 : 100. From our data flow procedure, PPIN includes d(f) VPROCS{i, 1 : 
100), i = 1 : 100, at this point in the program. Hence, DELETE cor- 
responds to REDUND = d{i) VPROCS{i,lQU),i = 1 : 100. Since 
COMM = DELETE, this communication can be completely eliminated. 
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5.2 Reduction in Volume of Communication 

Even if COMM ^ DELETE, a non-empty DELETE can still lead to a 
reduction in the volume of communication. The reduced amount of commu- 
nication, as discussed in the next section on ASD operations, is given by: 

COMM^ew - Ti?i , Ml i 0 t^2 , M 2 i 

= T-Di®^2,Ma^T^i,Mi0M2i (5.1) 

Under different conditions, this reduction could mean that the amount of 
data being communicated is reduced, or the number of processors to which 
data is sent could be reduced. 

Reduction in amount of data. The second term in the union operation eval- 
uates to null and the new communication involves a reduced amount of data 
being sent to the same processors if Mi (Hi) M 2 (Hi), but Hi H 2 . 
Intuitively, this case implies that not all of data to be communicated is al- 
ready available at the intended receivers, but the set of processors, to which 
any element common to Hi and H 2 has to be sent under communication 
|Hi <tt>H 2 , Ml I, is a subset of the set of processors where that element is al- 
ready available under the mapping function M 2 . Hence, the amount of data 
being communicated can be reduced from Hi to Hi 0 H 2 . 

For example, in Figure 2.1(a), the communication of e just before state- 
ment 24 is COMM = e(i) V PROCS{i,l : 100),* = 1 : 100. Our algo- 
rithm moves this communication to the point just after statement 13. The 
computation of INSERT for e subtracts the availability of data at this point, 
which is e(i) VPROCS(i, 1 : 100), * = 2 : 100, i.e., all but the first element 
of the vector e are already available on the columns of VPROCS. Hence, 
the new communication inserted at this point is: COM Mnew = e(l) 

V PROCS{l, 1 : 100), which represents a reduction of data. 

Reduction in number of processors involved. The first term in Equation 5.1 
evaluates to null and the new communication involves the same data being 
sent to fewer processors if Hi y D 2 , but Mi (Hi) ^/M 2 (Hi). This case arises 
when the data to be communicated is a subset of the data that is available, 
though it is available at fewer processors than needed. Thus, the data Hi 
can be sent to just the extra processors, Mi(Hi) 0 M 2 (Hi), where it is not 
available. 

This is illustrated in Figure 5.1: after message vectorization, the com- 
munication for variable d at the point just before statement 5 is d{i) 

V PROCS{i,l : 100),* = 1 : 25. Our data flow procedure moves it be- 
fore statement 1. The data already available {AVOUT) at this point is 
d{i) VPROCS{i,l : 50),* = 1 : 100. On subtracting this component 
from the communication being moved, the inserted communication is deter- 
mined to be COM Mnew = d{i) VPROCS{i, 51 : 100),* = 1 : 25. Thus, 
the data is sent to fewer processors. 
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1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 



HPF align (i, j) with VPRDCS(i,j) : : a, z 
HPF align (i) with VPRDCS(i,l) : : d, w 

d{i) VPROCS{i, 1 : 50),i = 1 : lOffi(i) ^ VPROCS{i, 1 : 50), i = 1 : 100 

d{i) VPROCS{i, 51 : 100), i = 1 : 25 



do i = 1, 100 
do j = 1 , 50 

a(i,j) = a(i,j) * d(j) 
end do 
end do 



do i = 1, 100 
do j = 1 , 50 

a(i,j) = a(i,j) * d(j) 
end do 
end do 



d{i) VPROCS{i, 1 : 100), i = 1 : 25 
do i = 1 , 25 do i = 1, 25 

do j = 1, 100 do j = 1, 100 

z(i,j) = a(i,j-l) * d(i) z(i.j) = a(i,j-l) * d(i) 

end do end do 

end do end do 



(a) 



(b) 



Fig. 5.1. Example illustrating reduction in number of processors 



A possible negative side-effect of reducing the volume of communication 
by subtracting the redundant component is that a single communication may 
be broken into a number of smaller communications, which may not be de- 
sirable. However, this side-effect can always be controlled because the result 
of the difference operation in Equation 5.1 can always be overestimated to 
give back |Z?i , Mi the original communication. While in some cases such as 
the ones illustrated above, the optimization will definitely reduce the cost of 
communication, in general the compiler needs a cost estimator to guide these 
decisions. 



5.3 Movement of Communication for Subsumption and for Hiding 
Latency 

Since our framework moves communications as early as legally possible, sub- 
sumption of communication placed earlier in the original program is nat- 
urally taken care of by the data-flow equations. For example, going back 
to Figure 2.1, INSERT for d before statement (8) is determined to be 
d{i) VPROCS{i,l : 100), f = 1 : 100, i.e, our analysis procedure hoists 
this communication from statement (24) to (8). As explained earlier, since 
REDUND for d at statement (15) is d{i) VPROCS{i, 100), z = 1 : 100, 
the communication of d at this point is subsumed. 

Our data flow analysis procedure moves communications as early as 
legally possible and avoids introducing unnecessary communication, thus han- 
dling conditional control flow effectively. Traditionally, researchers have pro- 
posed inserting sends ahead of receives to help hide the latency of com- 
munication. In the context of our framework, a better approach would be to 
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place blocking sends^ and non-blocking receives at the point of insertion 
of communication, and inserting a wait at the reference to non-local data 
for the receive to be over before reading that data (for vectorized commu- 
nication, the wait would be placed just before entering the outermost loop 
with respect to which the communication has been vectorized). This leads 
to the initiation of communication at the earliest possible point (under the 
constraint that there is no speculative communication), and waiting for the 
data to arrive only when it is needed. Thus, for communication that can be 
moved significantly further ahead, much of the latency can be hidden if the 
underlying target architecture supports overlap between communication and 
computation. 



6. Extensions: Communication Placement 

The placement of communication at the earliest point under our framework 
for partial redundancy elimination can also have some undesirable effects. 
The first set of problems is analogous to the issue of register pressure in the 
original data flow framework [31]. Early placement of communication puts 
additional pressure on the memory system at the receiver, as buffers to hold 
non-local data have to be maintained for longer periods of time. Apart from 
using up more memory, this can also degrade performance by polluting the 
cache (the received data may enter the cache due to an unpack operation 
and contribute to cache pollution if it is not referenced until much later). 
A related problem is the potentially increased contention for the network, 
which can reduce the effective network bandwidth. 

An even more striking problem is that earliest placement of communi- 
cation can lead to opportunities being missed for reducing the number of 
messages. This is illustrated by the simple example shown in Figure 6.1. If 
communication is moved eagerly to the earliest point, there are two separate 
messages needed for the references to B and C . On the other hand, if commu- 
nication for B is deferred suitably, it can be combined with communication 
for C. Thus, by considering the interactions among different eommunieations 
under different placements, we can obtain further reduetions in the number 
of messages. 

We describe briefly an algorithm which considers all communications in a 
procedure and their interactions under different placements, before finally se- 
lecting the placement of communications such that the number of messages is 
minimized through redundancy elimination and combining of messages [10]. 
The algorithm first determines, for each reference that needs communication, 

^ We refer to a send as blocking if it returns after the data being sent has been 
copied out of the user space, not one which waits for the data to be received at 
the other end. Non-blocking send can also be used, but in that case the compiler 
has to insert a wait for the send to be over before overwriting that data. 
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Earliest placement 


Improved placement 


HPF align (i) with VP(i) :: A, B, C 
B(1 : n) = ... 

B{i) — > VP{i -H 1), i = 1 : n — 1 

^(l : n) = . . . 

C{i) VP(i + l), i = 1 : n - 1 
A(2 : n) = B{1 : n — 1) -|- C(1 : n — 1) 


HPF align (i) with VP(i) :: A,H,C 
B(1 : n) = ... 

G)! ■,n) = ... 

[B{i),C{i)] VP{i + l), i = l: n-1 
A{2 : n) = B(\ : n — 1) -|- G(1 : n — 1) 



Fig. 6.1. Message combining opportunity missed by earliest placement 



the earliest and the latest safe placements of that communication, which also 
dominate [1] the original reference. In the second step, it marks the set of 
all candidate positions for each communication - these correspond to state- 
ments encountered during the dominator tree traversal from (the basic block 
containing) the latest position to (the basic block containing) the earliest 
marked position of that communication. The third step involves comparing 
the sets of possible communications at successive pairs of statements, and if 
the communication set at one statement is a subset of the communication set 
at the other statement, discarding the smaller set from further consideration. 
This step helps in pruning the search space for communication placement 
without losing any opportunity for reducing the number of messages. The 
fourth step consists of detecting and eliminating any subsumed communica- 
tion, by checking if the ASD corresponding to a communication is a subset of 
the ASD corresponding to another communication. In the fifth step, for each 
communication entry that still appears in multiple communication sets (at 
different statements), the hnal position is chosen using the following heuris- 
tic: the most constrained communication entry is selected and placed where 
it is compatible with (i.e., can be combined with) the largest number of 
other candidate communications. At the end of this step, the entries in the 
communication set at each statement can be partitioned into groups, each 
group consisting of one or more entries which will be combined into a single 
aggregate communication operation. Any flexibility still available in placing 
this aggregate (based on the candidate positions of members of the aggre- 
gate) can be used to push this communication later if reducing contention for 
buffers and cache is more important than overlap benefits (which is true for 
machines like the IBM SP2), or push it earlier if overlapping communication 
with computation is more important. We refer the interested reader to [10] 
for further details. 

The above algorithm is able to exploit both redundancy elimination and 
combining of communication to reduce the number of messages. It explores 
later placements of communication that preserve the benefits of redundancy 
elimination (normally obtained by moving communication earlier) , and in the 
process, also avoids unnecessary early movement of communication that only 
increases contention for communication buffers. 
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7. Operations on Available Section Descriptors 



In this section, we present the algorithms for various operations on the ASDs. 
Each operation is described in terms of further operations on the array section 
descriptors and the mapping descriptors that constitute the ASDs. There is 
an implicit order in each of those computations. The operations are first 
carried out over the array section descriptors, and then over the descriptors 
of the mapping functions applied to the resulting array section. 

The results of these operations cannot always be computed exactly, either 
due to some part of the operand(s) being unknown at compile-time, or due 
to the ASDs not being closed under that operation. In that case, the com- 
piler must appropriately either underestimate or overestimate the result so 
that the final optimization is only conservative, not incorrect. In Section 4.1, 
for each data flow equation, we described whether the result was underes- 
timated or overestimated. Based on those constraints, we observe that the 
results of intersection and union operations are always underestimated in 
our framework, while the result of the difference operation may need to be 
underestimated or overestimated, depending on the data flow equation. In 
our descriptions, the special value ^is used to represent statically unknown 
parameters. This special value ^is treated as null when the results are being 
underestimated. In case of the difference operation, vari 0 var 2 , when the 
result is to be overestimated, a resulting value of ^is interpreted as var\. 

Intersection Operation. The intersection of two ASDs represents the elements 
constituting the common part of their array sections, that are mapped to the 
same processors. The operation is given by: 



TDi , Ml i , M2 i = T(i?i ^D2 ) , (Ml ) i 

In the above equation. Mi <t^M 2 represents the intersection of each of the 
mapping functions M\ and M 2 applied to the array region {D\ <t^Z? 2 ). 

Union Operation. The union operation presents some difficulty because the 
ASDs are not closed under the union operation. The same is true for data 
descriptors like the DADs and the BRSDs, that have been used in practi- 
cal optimizing compilers [4,22]. As we explained earlier, in the context of 
our framework, any errors introduced during this approximation should be 
towards underestimating the extent of the descriptor. 

One way to minimize the loss of information in computing \D\, M\[ 
|D 2 ,M 2 i is to maintain a list consisting of (1) (2) ’\D 2 ,M 2 [, (3) 

t(Di JJ. D 2 ), (Ml <J=>M 2 )i, and (4) '\{Di <J=>T> 2 ), (Mi J]. M 2 )],. Subsequently, any 
operations involving the descriptor would have to be carried out over all the 
elements in that list. The items (3) and (4) are included in the list because 
they potentially provide more useful information than just (1) and (2). 
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(1) D1 = A(l:100,3) Ml = [2,1], Fl(j) = F2(j) = j 





(a) 

(2) D2 = A(l:100,4) M2 = [2,1], Fl(j) = F2(j) = j 





(b) 

(3) D1 U D2 = A(l:100,3:4) Ml & M2 = [2,1], FI ( j ) = F2(j) = j 




(c) 

Fig. 7.1. Example of ASD union operation 



Example. Let = ]A{1 : 100, 3), |[2, 1], FjJ., and let '\D 2 ,M 2 [ = 

TA(1 : 100,4), where Fi(j) = F 2 {j) = j. Figure 7.1(a) 

shows '\Di,Mi[ and Figure 7.1(b) shows '\D 2 ,M 2 [. Consider the subset test: 
^A{7 : 8, 3 : 4), |[2, 1], ,/tZli, Mij JJ. tZ? 2 , M 2 |. If the union is represented 
as the list of ASDs 7.1(a) and 7.1(b) only, this subset test will fail, which 
is inaccurate. The ASD for item (3) '\{Di )( D 2 ), (Mi <t^M 2 )| is shown in 
Figure 7.1(c). The subset test succeeds for 7.1(c). 
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However, this solution is too expensive, as the size of the list grows ex- 
ponentially in the number of original descriptors to be unioned (the growth 
would be linear if only items ( 1 ) and ( 2 ) were included in the list). There 
are a number of optimizations that can be done to exclude redundant terms 
from the above list: 

— If Di oc Z?2, the first term can be dropped from the list, since '\Di, Mi l is 
subsumed by l{Di 4 t>Z? 2 ), (Mi JJ- M2)|. Similarly, appropriate terms can be 
dropped when D2 oc Z?i, Mi oc M2, or M2 oc Mi. 

— If I?i = D2, only the fourth term needs to be retained, since tHi , Mi JJ. M2 1 
subsumes all other terms. If Mi = M2, the third term, that effectively 
evaluates to ID\ JJ. Z?2, Mi J, subsumes all other terms, which may hence be 
dropped. 

In addition to these optimizations, the compiler can use heuristics like drop- 
ping terms associated with the smaller array regions to ensure that the size 
of such lists is bounded by a constant. Further discussion of these heuristics 
is beyond the scope of this article. In our prototype implementation, for sim- 
plicity, the compiler represents the result of union operation as a list of the 
individual ASDs unless it can infer that one ASD is a subset of the other. 
This ensures, at the expense of some accuracy, that the size of the list is 
linearly bound by the number of communications. 

Difference Operation. The difference operation causes a part of the data as- 
sociated with the first ASD to be invalidated at the processors to which that 
part is mapped under the second ASD. 

TDi,Mii® T-D2,M2i = l{Di®D2),Mil\llDi,{Mi®M2)l 

list(T(Di ® D2), Mii, TDi, (Ml 0 M2)i) 

The result represents (i) the elements of the reduced array region 0 D2) 
that are still available at processors given by the original mapping function 
Ml, and (ii) the elements of the original region D\ that are still available at 
the processors to which those elements are not mapped under M2- 

We observe that if either D\ and D2 or Mi and M2 are mutually disjoint, 
the difference operation gives back the original operand |Di,Mi|. Also, the 
first term in the above list evaluates to null if Di y D2, and the second term 
evaluates to null if Mi y M2- The latter case always holds for the difference 
operations involving the KILL information in our data flow equations, be- 
cause the ASD for killed data always has the universal mapping function U, 
signifying that data is killed on all processors. 

7.1 Operations on Bonnded Regular Section Descriptors 

Let Di and D2 be two BRSDs, that correspond to array sections A( 5 'i, . . . , 
Sf) and A( 5 ' 2 , . . . , S'^) respectively, where each S\ and S2 term is a range 
represented as a triple. We now describe the computations corresponding to 
different operations on Di and D2. 
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Intersection Operation. The result is obtained by carrying out individual 
intersection operations on the ranges corresponding to each array dimension. 

Di ^D2 = A{Sl 

The formula for computing the exact intersection of two ranges, S\ and S 2 , 
expressed as triples, is given in [28]. The result of the intersection operation 
is another range represented as a triple. If either of the two ranges is equal 
to the result also takes the value 



function union-range (^i, S'2) 
i'f {Si = return S2 

i'f {S2 = return Si 

Let Si = {h : Ui : Si), S2 = {h '■ U2 : S2) 
if (si mod S 2 = 0)&((/2 ® h) mod si = 0) 
s = Si 

else if (s2 mod Si = 0 )&((l 2 ^i) mod S 2 = 0) 

s = S2 

else return list (S'!, S'2) 
if {h hMui /I2) 

return {li : max(ui,U2) : s) 
else if {I2 li)k,{u2 /'ll) 

return {I2 : max(ui,U2) : s) 
return list (S*!, 52) 
end union-range . 

Fig. 7.2. Algorithm to compute union of ranges 



Union Operation. The BRSDs are not closed under the union operation. The 
algorithm described in [22] to compute an approximate union potentially 
overestimates the region corresponding to the union. We need a different 
algorithm, one that underestimates the array region while being conservative. 
In the special case when the array regions identified by Di and D 2 differ only 
in one dimension, say, i, the union operation is given by: 

DaD2 = A{Sl,...,Sf\S{i^ SI S{+\...,S^) 

In the most general case when the regions differ in each dimension, an exhaus- 
tive list of regions corresponding to Dii).D 2 would include (1) . . . , 5"), 

(2) A{S 2 , . . . , S 2 ), and the following n terms: JJ- Sf -^S ^, . . . , 5" 

S 2 ), . . . , ^(<S'i . . . , 5'"*^ 4^S2^^ , Si JJ. S 2 ). Once again, heuristics are 

needed to keep the lists bounded. Figure 7.2 describes an algorithm for com- 
puting an approximate union of two ranges. 
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Difference Operation. Conceptually, the result is a list of n regions, each 
region corresponding to the difference taken along one array dimension. 

D^^D2 = list(A(5i ^ 52 , . . . , S^), A(Sl, 0 S^)) 

When the regions corresponding to Di and D 2 differ only in one dimension, 
all except for one of the terms in the above expression evaluate to null. When 
there is more than one non- null term, the result may be represented as a list, 
and heuristics used (as discussed earlier) to keep such lists bounded in length. 

Again, the formula for computing the difference, ii* = S'! when both 
S\ and SI are expressed as triples, is given in [28]. If either S\ = ^or 
then Si = 



7.2 Operations on Mapping Function Descriptors 



Consider two mapping function descriptors. Mi = tPi,FiJ, and M 2 = 
]P 2 ,F 2 [, associated with the same array A(S') section. The intersection, 
union, and difference operations are defined as: 



Ml <4>M2 
Ml JJ. M2 
Ml 0 M 2 



fPl,Fi^F2i ifPi=P2 

^ if Pi = P 2 

tPl,Pl^P2i ifPl=P2 

list (Ml, M 2 ) ifPi = P 2 

tPi,Pi0P2i ifPl=P2 

^ if Pi = P 2 



Let Pi = [P/ , . . . , Pf] , and let P 2 = [P 2 ^ , • • • , Pf] • The computations of 
various operations between Pi and P 2 can be described at two levels: (i) 
computations of Pi <t 4 p 2 , Pi 1 ]- P 2 , and Pi 0 P 2 in terms of further operations 
over Ff and P 2 , 1 i n, and (ii) computation of P^ P^ JJ- P 2 , and 

P( 0 P 2 . 

The computations of the first type are identical to those described for the 
BRSDs. For example, the intersection operation is given by: 

Fi 44 P 2 = [Pi^ ^Fl..., Pi" ^F^] 

We now describe the intersection, union and difference operations on the 
mapping functions of individual array dimensions. As mentioned in Section 
2 , depending on whether P* represents a true array dimension or a “missing” 
dimension, FI is either a function of the type P((j) = (ci —f + h ■ ci + ui : 
si), or simply a constant range, FI = {li : m : si). We need to describe only 
the results for the former case, since the latter can be viewed as a special 
case of the former, with ci set to zero. Thus, in the remainder of this section, 
we shall regard Ff and P 2 as functions of the form Fl(j) = (ci + h : 
Cl -^ + ui : si), and P 2 (j) = (c 2 —5 + ^2 : C 2 -^ + U 2 ■ 82 )- The data domain 
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for these functions is dimension P* of the array section referred to as . 
Each of the operations between FI and P| has two cases: 

Case 1: ci = C 2 - 

All of the operations can be collectively described as follows. Let 6 denote 
the intersection, union, or the difference operator, and let {I \ u \ s) = {l\ : 
u\ : si) 9 {I2 : M2 : S2) as described in the previous subsection. The result is 
given by: 

(Pi 6 F^){j) = (ci -j’ + P Cl + M : s) 

Case 2: c\ = 02 - 

Intersection Operation. In this case, we check if one of the mapping functions 
completely covers the other function over the data domain, and otherwise 
return The high-level algorithm is: 

F{{l) \SF 

return FI 

else if Flij) /Fi{j),j 
return PJ 
else return ^ 

Let = {low : high : step). We now describe the conditions for checking 
if Pi is covered by Fi^. For j \{low : high : step), 

F{{j) / F^{j) 

' {ci ^ + h : Cl ^ + ui : si) y {c 2 ^ + h ■ C2 ^ + U2 '. S2) 

A set of sufficient conditions for the above relationship to hold is: (i) Ci —j -|- 
h yc 2 —g + h, low j high, (u) Cl ^ + ui C 2 ^ + U 2 ,low j high, 
(iii) (ci G) C2) mod S2 = 0 , (iv) (^i 0 I2) mod S2 = 0 , and (v) si mod S2 = 0 . 
In the special case when si = S2 = 1 , the last three conditions are satisfied 
trivially, and conditions (i) and (ii) are both necessary and sufficient. The 
conditions (i) and (ii) can be further simplified to the following, which can 
be tested efficiently: 

(i) Cl ^ow + li y' C2 ^ow + I 2 , 

(ii) ci^igh + ui C2^igh + U2, if ci ^C2 

(i) Cl -^igh + li /"C2 -^igh + I 2 , 

(ii) ci^ow + ui C2^ow + U2, if ci < C 2 

The conditions for checking if P2(j) v/ Fl{j),j , can be derived in a 

similar manner. 

Union Operation. We check if one of the mapping functions completely covers 
the other, and otherwise return a list, as shown below: 
if Fl{j) yF^{j),j \SF 
return P2 

else if F^{j) /F{{j),j 
return F{ 

else return l±st{Fl,Fy. 
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Difference Operation. We check for the special cases of the first mapping 
function being covered by the second mapping function, or the two being 
mutually disjoint, and return ^otherwise. As mentioned earlier, ^is inter- 
preted as =>or FI, depending on whether the result is to be underestimated 
or overestimated. 

if F{ <^Fi = F{ 
return => 

else if Ff -^F^ = ^ 
return FI 
else return 



8 . Preliminary Implementation and Results 

We have done a prototype implementation of our data flow framework as 
part of the pHPF compiler for HPF [19]. Our implementation is meant to 
serve as a platform to investigate the potential performance benefits from the 
data flow analysis, and currently represents a very simplified version of the 
analysis presented in this chapter. Currently, our compiler only eliminates 
fully redundant communication, it does not try to reduce the amount of 
data communicated or hide the latency of communication. Furthermore, for 
the sake of simplicity of implementation, the compiler does not currently 
move communication across different loop nests. However, since the analysis is 
performed on the program before it is transformed with loop distribution for 
vectorizing communication and eliminating extra guard statements [19,23], 
our results include some potential benefits of a more global analysis as well. 



HPF Align A{i,J),B{i,j) with VPROCS{i,j) 

HPF Distribute FPi?OC'S'(— ?block) 
do j = 3, n 
do i = 3,n 

A{i,j) = . ,.B{i,j (E)l) + B{i,j ^2) . . . 

end do 
end do 

Fig. 8.1. Extension of subset-test for nearest-neighbor communication 



In our implementation, we have incorporated an extension to make our 
analysis exploit information about the distribution of arrays on physical 
processors in special cases, not merely the alignment information. Con- 
sider the program shown in Figure 8.1. On the basis of alignment informa- 
tion, we would view the communications B{i,j 0 1) V PROCS{i, j) and 

B{i,j 0 2) V PROCS{i, j) as separate. However, taking into account the 
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block distribution of the second array dimension, we can recognize the com- 
munication for B{i,j 0 2) as subsuming the communication for B{i,j 0 1), as 
the former involves two boundary columns and the latter involves one bound- 
ary column being communicated. We have implemented this by extending the 
test for one communication being a subset of another, for nearest-neighbor 
communication. 

We now describe the results of our experiments performed on five pro- 
grams, which are part of an HPF benchmark suite developed by Applied 
Parallel Research, Inc. The first program, tomcatv (originally from SPEC 
benchmarks), does mesh generation with Thompson’s solver. The second pro- 
gram, x42, is an explicit modeling system using fourth order differencing. The 
third program, tred2, is from the EISPACK library. It reduces a real sym- 
metric matrix to a symmetric tridiagonal matrix, using and accumulating 
orthogonal similarity transformations. The program grid performs a 9-point 
stencil computation followed by global reductions. The last program, baro, 
performs computations for a shallow water atmospheric model. 



Program 


#Re 

Original 


ifs with Comm. 
Redundancy Elim. 


% Refs with 
Redundant Comm. 


tomcatv (—^ block) 


47 


35 


25.5 


grid (block, block) 


15 


11 


26.7 


x42 (block, block) 


33 


17 


48.5 


tred2 (block, 


22 


19 


13.6 


baro (^ block) 
comp 
cmslow 
intbal 
graph 1 


3 


3 


0 


47 


34 


27.7 


44 


21 


52.3 


3 


1 


66.7 


0 


0 


N/A 



Table 8.1. Results of optimization to eliminate redundant communication 



Table 8.1 shows the static counts of the number of references to array and 
scalar variables in the program which required interprocessor communication, 
with and without the optimization for redundancy elimination. The first col- 
umn describes for each program how the main HPF template was distributed 
on processors, since that affects the number of communications needed. Each 
of the programs was compiled with the number of physical processors left 
unspecified at compile-time. As can be seen from the table, there is an ap- 
preciable reduction in the number of references that need communication. 
Amongst the subroutines which had more than ten references needing com- 
munication, we observed a range of 13.6% to 52.3% of those communications 
(in terms of the static counts of references) as being completely redundant. 
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The results for the program baro have been presented in terms of individual 
results for each subroutine. It was particularly encouraging to note that the 
subroutine that shows the best improvement, cmslow, is the most frequently 
executed subroutine and accounts for the maximum amount of time spent 
in the program. The improvements for tred2 were modest, the only commu- 
nications eliminated were those for scalars. However, a hand-analysis of the 
program showed opportunities for more global communication optimizations 
captured by our framework, which were not implemented in this version of 
the compiler. 

We now present some performance results obtained on the IBM SP-I ma- 
chine. These programs were compiled using two versions of the HPF compiler, 
one which does not perform any data flow analysis to eliminate redundant 
communication, and the other one which does. These timings do not include 
any time spent on I/O, since that had been commented out from the main 
computation in those programs. Tables 8.2 and 8.3 show the performance of 
the programs tomcatv and grid for various values of the number of proces- 
sors (p) and the data size (n). The tables give the execution times without 
applying the redundant message elimination (RME) optimization and after 
applying this optimization. 



n 


















65 


0.829 


0.825 


0.866 


0.426 


0.819 


0.366 


0.838 


0.379 


129 


3.321 


3.314 


1.937 


1.183 


1.575 


0.815 


1.469 


0.671 


257 


13.175 


13.257 


5.314 


3.923 


3.714 


2.305 


2.990 


1.560 


513 


55.977 


55.902 


17.973 


15.304 


10.765 


8.085 


7.361 


4.732 


1025 


244.286 


245.509 


63.028 


57.905 


36.000 


30.476 


21.504 


16.263 



Table 8.2. Execution times (in seconds) of tomcatv on IBM SP-I 



Table 8.2 shows noticeable improvements in the performance of tomcatv 
due to redundant message elimination. The performance improvement on 
16 processors varies from 25% to 55% for different data sizes ranging from 
n = 65 to n = 1025. The relative gain in performance is lower for larger data 
sizes and for programs run on fewer processors because of computation time 
dominating the communication time. However, even for larger data sizes, the 
performance improvement on 16 processors is quite significant. For smaller 
data sizes, the improvement is much more substantial. It is interesting to note 
that redundant message elimination enables the compiler to obtain speedups 
for a data size as small as n = 65, whereas there were no speedups obtained 
for that data size without this optimization. These results confirm the ef- 
fectiveness of redundant message elimination in reducing the communication 
costs of this benchmark program. 
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n 


1 P = 1 


1 P = 4 


1 P = 8 


II 




Orig 


RME 


Orig 


RME 


Orig 


RME 


Orig 


RME 


64 


0.927 


0.927 


0.281 


0.270 


0.156 


0.143 


0.111 


0.093 


128 


3.702 


3.699 


1.002 


0.982 


0.496 


0.480 


0.304 


0.285 


256 


14.775 


14.772 


3.818 


3.789 


1.762 


1.730 


1.052 


1.015 


512 


59.061 


59.070 


14.972 


14.927 


6.816 


6.763 


3.906 


3.859 


1024 


236.227 


236.192 


59.459 


59.351 


26.665 


26.576 


15.163 


15.072 



Table 8.3. Execution times (in seconds) of grid on IBM SP-1 



The performance improvement for grid is only modest. This is because 
the communication time for this benchmark program is very small compared 
to the overall execution time, leaving relatively little room for improvement. 
Due to this low communication time, we observe good speedups even with- 
out redundant message elimination. However, even for this program, when 
n = 64, where the communication time is relatively higher, we notice a per- 
formance improvement of 3-16% after redundant message elimination. Thus, 
our optimization does reduce the communication cost, but makes a significant 
difference to overall performance only if communication cost itself is high. 



9. Related Work 

9.1 Global Communication Optimizations 

Many other researchers have used data-flow analysis to optimize communica- 
tion. Granston and Veidenbaum [16] use data-flow analysis to detect redun- 
dant accesses to global memory in a hierarchical, shared-memory machine. 
However, they do not explicitly represent information about the availability 
of data on processors. Instead, they rely on simplistic assumptions about 
scheduling of parallel loops, which are often not applicable. 

Amarasinghe and Lam [3] use the last write tree framework to perform 
optimizations like eliminating redundant messages. Their framework does 
not handle general conditional statements, and they do not eliminate redun- 
dant communication due to different references in arbitrary statements (for 
instance, statements appearing in different loop nests). 

Gong et al. [15] describe a data-flow procedure that unifies optimiza- 
tions like vectorizing communication, removing redundant communication, 
and moving communication earlier to hide latency. They only handle pro- 
grams with singly nested loops and one-dimensional arrays, and with very 
simple subscripts. 

Von Hanxleden et al. describe the Give-N-Take framework [40,41] that 
they use for generating communication in the presence of indirection arrays. 
Their framework is based on the producer-consumer concept, and breaks 
communication into sends and receives, which are placed separately in a 
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balanced manner. By using an eager solution for sends and a lazy solution for 
receives, they obtain maximal separation between sends and receives, which 
is used to overlap communication with computation. Their work focuses on 
irregular subscripts, and therefore does not attempt to obtain more precise 
information about array sections. 

Kennedy and Nedeljkovic [26] extend the Give-N-Take framework to ex- 
ploit information about array sections. They use bit vectors to represent 
array sections for the purpose of data flow analysis. This simplifies data flow 
analysis, in which no set operations need to be performed on array section 
descriptors, at the expense of precision in representing data to be communi- 
cated. 

Kennedy and Sethi [27] present a framework, based on the lazy code mo- 
tion technique [24] , which maximizes latency hiding by determining the earli- 
est placement of sends and the latest balanced placement of the correspond- 
ing receives. They present techniques to constrain the placement further such 
that the total size of buffers used to hold non-local data does not exceed a 
fixed limit. 

9.2 Data Flow Analysis and Data Descriptors 

Suppression of partially redundant code is a powerful code optimization and 
has found its way into a number of commercial compilers. Morel and Ren- 
voise [31] first proposed a bidirectional bit-vector algorithm for the suppres- 
sion of partial redundancies. The complexity of bidirectional problems for 
bit-vector representations of data flow information was addressed by later 
papers [12,13,24,25]. This work applies the techniques from [31] and [13] 
for eliminating partially redundant communication. We extend the previous 
result on the decomposition of this bidirectional problem into efficient unidi- 
rectional problems by proving it in the context of an approximate data flow 
representation, namely the ASDs. 

Interval analysis, introduced by Allen and Cocke [2], has been used to 
solve several data flow problems. The work by Gross and Steenkiste [17] was 
the first to extend interval analysis to handle array sections. Our data flow 
procedure refines the algorithms described in [17] by using information about 
loop-carried data dependences while summarizing data flow information for 
intervals. In addition, we apply data flow analysis to ASD’s that represent 
both array section information and information about the processor elements 
on which the array elements are available. 

We have used ideas from well-known representations of array sections used 
in other contexts [4,8,22] for developing a representation of communication in 
this work. In particular, we use the BRSD proposed by Havlak and Kennedy 
[22] to represent, in our framework, the data involved in communication. The 
concept of a mapping function descriptor that we have introduced represents 
a crucial extension to the notion of data descriptor. It enables representation 
of communication and of the data made available at various processors by 
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prior communications. That in turn allows data flow analysis to be used for 
powerful communication optimizations. 



10. Conclusions 

We have presented a data-flow framework for reducing communication costs 
in a program. This framework provides a unified algorithm for performing a 
number of optimizations that eliminate redundant communication and that 
reduce the volume of communication, by reducing both the data and the 
number of processors involved. The algorithm also determines the earliest 
point at which communication can legally be moved, without introducing 
extra communication. That can help hide the latency of communication. This 
algorithm is quite general, it handles control flow and performs optimizations 
across loop nests. It also does not depend on a detailed knowledge of the 
explicit communication representation. 

An important feature of our approach is that the analysis is performed 
at the granularity of sections of arrays and processors, that considerably en- 
hances the scope of optimizations based on eliminating partial redundancies. 
We prove that in the context of an ASD representation also, the bidirectional 
problem of determining placement can be decomposed into a backward prob- 
lem followed by a forward correction. This ensures the practicality of the 
analysis by making it efficient. The preliminary results from a simplified im- 
plementation of this framework show significant performance improvements, 
and confirm the effectiveness of the optimization to eliminate redundant com- 
munication. 

In the future, we plan to conduct more extensive experiments, to study 
the performance impact of other optimizations captured by our framework, 
like reducing the volume of communication and hiding the latency of com- 
munication. This will require further examination of issues like management 
of buffers containing non-local data from other processors. Future work will 
also involve extending the data flow framework to perform interprocedural 
optimizations. There is also scope for integrating the concepts of ownership 
and availability of data, and developing algorithms for additional optimiza- 
tions that exploit the fact that processors other than the owners can also 
send values to processors that need them. 

Acknowledgement. The author would like to thank Soumen Chakrabarti, Jong- 
Deok Choi, Edith Schonberg, and Harini Srinivasan for their help with varions 
aspects of this work. 
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Summary. Communication latency is a key parameter which affects the perfor- 
mance of distributed-memory multiprocessors. Instruction-level multithreading at- 
tempts to tolerate latency by overlapping communication with computation. This 
chapter explicates the multithreading capabilities of the EM-X distributed-memory 
multiprocessor through empirical studies. The EM-X provides hardware supports 
for dynamic function spawning and instruction-level multithreading. The supports 
include a by-passing mechanism for direct remote reads and writes, hardware FIEO 
thread scheduling, and dedicated instructions for generating fixed-sized communi- 
cation packets based on one-sided communication. Two problems of bitonic sorting 
and Fast Fourier Transform are selected for experiments. Parameters that charac- 
terize the performance of multithreading are investigated, including the number of 
threads, the number of thread switches, the run length, and the number of remote 
reads. Experimental results indicate that the best communication performance oc- 
curs when the number of threads is two to four. A large number of threads of over 
eight is found inefficient and has adversely affected the overall performance. FFT 
yielded over 95% overlapping due to a large amount of computation and commu- 
nication parallelism across threads. Even at the absence of thread computation 
parallelism, multithreading helps overlap over 35% of the communication time for 
bitonic sorting. 



1. Introduction 

Distributed-memory multiprocessors have been regarded as a viable architec- 
ture of scalable and economical design in building large parallel machines to 
meet the ever-increasing demand for high performance computing. In these 
paradigms of machine architecture, it is deemed relatively simple to increase 
a machine’s capability simply by adding more processors to the system incre- 
mentally as required. There are various research prototypes as well as com- 
mercial machines having this distributed- memory /message-passing paradigm 
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such as Cray T3E [18], ETL EM-X [10, 11], Intel and IBM ASCIs [1], IBM 
SP-2 [3], Tera MTA [4], etc. 

In a distributed-memory machine, data needs to be distributed so there 
is no overlapping or copying of major data. Typical distributed-memory ma- 
chines incur much latency, ranging approximately from a few to tens of micro 
seconds for a single remote read operation. The gap between processor cycle 
and remote memory access time becomes wider, as the processor technology 
improves and rigorously exploits instruction level parallelism. The message- 
passing machine SP-2 incurs approximately 40 micro seconds to read data 
allocated to remote processors. Considering that the microprocessors are run- 
ning at over 66.5 MHz (15 nano seconds cycle time) for the SP-2 590 model, 
the loss due to a remote read operation is enormous; A single remote read 
operation would cost 40 micro sec/ 15 nsec, or 2667 cycles. 

Various approaches have been developed to reduce/hide/tolerate com- 
munication time, as well as to study communication behavior for general 
purpose parallel computing [7]. Data partitioning used in High Performance 
Fortran is a typical method to reduce communication overhead. Analyzing 
the behavior of the program at compile time, data can be partitioned and 
allocated to processors such that runtime data movement can be minimized. 
While data distribution can be carefully designed to minimize the number 
of remote reads in the course of computation, this approach is effective for 
specific applications where data partitioning can be well tuned. Applications 
such as adaptive mesh computational science problems change their behavior 
at runtime. The initial data distribution is often found invalid and inefficient 
after some computations. 

Multithreading aims to tolerate memory latency using context switch. 
Through a split-phase read transaction, a processor switches to another 
thread instead of waiting for the requested data to arrive, thereby masking 
the detrimental effect of latency [8,9]. The Heterogeneous Element Processor 
(HEP) designed by Burton Smith provides up to 128 threads [19]. A thread 
switch occurs in every instruction with 100 nsec switching cost. Threads are 
usually ended by remote read instructions since those may incur long latencies 
if the requested data is located in a remote processor [13]. The Monsoon data- 
flow machine developed at MIT switches context every instruction, where a 
thread consists of a single instruction [14]. 

The EM-4 multiprocessor provides hardware support for multithread- 
ing [16,17]. Thread switch takes place whenever a remote memory read is 
encountered. Threads can also be suspended with explicit thread scheduling. 
The Alewife multiprocessor provides a hardware support for multithread- 
ing [2]. Together with prefetching, block multithreading with four hardware 
contexts has been shown to be effective in tolerating the latency caused on 
cache misses for shared- memory applications such as MP3D. The Tera mul- 
tithreaded architecture (MTA) provides hardware support for multithread- 
ing [4]. The maximum of 128 threads are provided per processor. Context 
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switch takes place whenever a remote load or synchronizing load is en- 
countered. The RWC-1 prototype minimizes the context switch overhead by 
prefetching [12]. 

An analytic model for multithreading is studied in [15]. The study in- 
dicated that the performance of multithreading can be classified into three 
regions: linear, transition, and saturation. The performance of multithread- 
ing is proportional to the number of threads in the linear region while it 
depends only on the remote reference rate and switch cost in the saturation 
region. The Threaded Abstract Machine studied by Culler et al. exploits par- 
allelism across multiple threads [6]. Fine-grain threads share registers to ex- 
ploit fine-grain parallelism using implicit switching. Experimental results on 
EM-4 indicated that simple-minded data distribution can give performance 
comparable to that of the best performing algorithms with hand-crafted data 
distribution but no threading [21]. 

This chapter reports on the multithreading performance of the 80-proces- 
sor EM-X distributed-memory multiprocessor. The machine was built at the 
Electrotechnical Laboratory and has been fully operational since December 
1995. Several critical parameters which characterize the performance of mul- 
tithreading are investigated, including the number of threads, the run length, 
the number of remote reads, and the number of context switches. The inter- 
play between the parameters is explained with experimental results. Two 
widely used problems are selected for performance verification: bitonic sort- 
ing and Fast Fourier Transform. The two problems have been revisited and 
suitable algorithms are developed for the multithreaded machine environ- 
ment. Data and workload distribution strategies are developed to explicate 
their performance. The ultimate goal of multithreading is to tolerate com- 
munication time. In this respect, the experiments are carried out to identify 
how multithreading helps overlap communication with computation. 



2. Multithreading Principles and Its Realization 

2.1 The Principle 

A thread is a set of instructions which are executed in sequence. The mul- 
tithreaded execution model exploits parallelism across threads to improve 
the performance of multiprocessors [8,9]. Threads are usually delimited by 
remote read instructions which may incur long latency if the requested data 
is located in a remote processor. Through a split-phase read mechanism, a 
processor switches to another thread instead of waiting for the requested 
data to arrive, thereby masking the detrimental effect of latency. Figure 2.1 
illustrates the basic principle of multithreading. 

Processor 0, PO, has three threads, TO, Tl,and T2, ready to execute in 
the ready queue. PO indicates that TO is currently being executed which is 
indicated by a thick dark line. PO starts executing the first thread, TO. As 
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Fig. 2.1. Multithreading on p processors, tcs = context switch time, trr = 
remote read time, RR = remote read. 



TO is executed, a remote read operation is reached, denoted by a dotted line. 
The processor switches to T1 while the remote memory read request RRO is 
pending. The processor again switches to T2 when another remote memory 
read occurs in Tl. After T2 completes, TO can resume its execution assuming 
the requested data has arrived. 

There are important parameters which characterize the performance of 
multithreading. They include (1) the number of threads per processor, (2) 
the number of remote reads per thread, (3) context switch mechanism, (4) 
remote memory latency, (5) remote memory servicing mechanism, and (6) the 
number of instructions in a thread. While this is not the exhaustive list and 
there are yet other important issues, we will briefly explain the implications 
of these issues below. 

The number of active threads indicates the amount of parallelism. To be 
more specific, the amount of parallelism can be classified into computation 
parallelism and communication parallelism. Computation parallelism refers 
to the ‘conventional’ parallelism while communication parallelism refers to 
the way threads can communicate with other threads residing in other pro- 
cessors. The figure shows that processor 0 has three active threads, indicating 
three thread computation parallelism, provided that they are independent of 
each other. Communication parallelism is not apparent from the figure as the 
way the three threads communicate is not specified in the figure. This will 
become clearer in the later sections. It is desirable to have a large number of 
threads since this will likely help tolerate latencies. However, the maximum 
number of threads that can be active (including the suspended status) at a 
point in time is bound by the amount of memory available to the program. 

The number of remote reads per thread determines the frequency of thread 
switching and in turn run length. For each remote read, there will be a thread 
switching. The number of switches is proportional to the number of remote 
reads. It is therefore desirable that the remote reads be distributed evenly 







Tolerating Communication Latency through Dynamic Thread Invocation 529 



over the life of a thread. This distribution leads to the issue of thread run 
length. Thread run length is determined by the number of uninterrupted in- 
structions executed between two consecutive remote reads. The performance 
of multithreading is strongly affected by this parameter. If the run length is 
small, it will be difficult to tolerate the latency because there are not enough 
instructions to execute while the remote read is outstanding. Suppose that 
the dark area of T2 in the figure is very short, consisting of say 10 instruc- 
tions. The machine will not be able to tolerate the remote memory latency 
since it is too short for RRO to return. A remote read will typically take tens 
to hundreds of cycles, if not thousands. The RRO shown in Figure 2.1 will 
likely return after T2 runs to completion. In that event, the machine will wait 
until RRO returns as there is no thread ready for execution. 

Thread switch refers to how the control of a thread is transferred to an- 
other thread. Among the types of context switches are explicit switching and 
implicit switching. Implicit switching allows multiple threads to share regis- 
ters while explicit switching does not. Implicit switching is literally implicit 
in the sense that there is essentially no visible switching from the register 
point of view. This method, therefore, requires little or no switching over- 
head. However, the scheduling of registers and threads can be a challenging 
task. In the explicit switching, threads do not share registers. A single thread 
uses all the registers. Therefore, there is no issue as to how registers and 
threads are scheduled. However, the main problem of this explicit switch- 
ing is the cost associated with register saving and restoring. For each thread 
switch, those registers currently being used by the thread need to be saved 
to preserve the status of the thread. When the thread that was suspended 
resumes, the registers for that resuming thread will have to be restored. This 
register saving and restoring can be a bottleneck for efficient multithreading. 
Several approaches have been taken to solve the problems associated with the 
two switching mechanisms. One approach would be to have a few explicit sets 
of registers, where a thread is assigned to a register set. Another approach 
is to prefetch the corresponding registers in such a way that the thread can 
immediately resume when a thread switch occurs. 

Communication latency is the main target of multithreading. It can vary 
depending on the technology used to build the machine and the interconnec- 
tion network. A desirable combination would be that the network bandwidth 
be comparable with the processor clock speed. Large disparity between the 
machine clock speed and the network bandwidth can be problematic when 
using multithreading. If the machine is fast but the network is slow, the pres- 
sure to tolerate the latency is high, in which case multithreading will unlikely 
give noticeable results. On the other hand, if the machine is slow but the net- 
work is fast, the effect of multithreading may not be visible cither since the 
latency is high in any case. There is a clear trade-off between the clock speed 
and the network bandwidth. 
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A mechanism to service remote memory operations is also an important 
factor determining the performance of multithreading. The simplest approach 
would be to have the main processor service remote read/ write requests. In 
this case, the possibility of overlapping computation with communication will 
be very slim, giving little advantages of using multithreading. Some machines 
such as IBM SP-2 and Intel Paragon employ a communication co-processor 
to handle some communication related activities, i.e., copying data out of 
memory to the communication buffer or from buffer to memory. The main 
advantage of using a communication co-processor is to take some burden off 
the main processor, thereby leaving the main processor for computation for 
most of the time. Multithreading can have a significant boost if equipped 
with such a remote servicing mechanism. The EM-X multithread processor 
employs a remote by-passing mechanism to perform direct remote read/ write 
operations, or remote direet memory access (RDMA). The main processor is 
now aware of such remote read/ write activities as will be explained shortly. 

The number of instructions in a thread is often referred to as thread granu- 
larity. While there is no clear agreement on such a classification, thread gran- 
ularity can be classified into three categories: fine-grain, medium-grain, and 
coarse-grain. Fine-grain threading typically refers to a thread of a few to tens 
of instructions. It is essentially for instruction-level multithreading. Medium- 
grain threading can be viewed as a loop-level or function-level threading, 
where a thread consists of hundreds of instructions. Coarse-grain threading 
may be viewed as more of a task-level threading, where each tread consists 
of thousands of instructions. However, coarse-grain threading should not be 
interpreted as operating system level multitasking, where different programs 
interleave to mask off page faults. How the threads are formed is another 
important question which needs be answered. Threads can be automatically 
generated by compilers or explicitly specified by the programmers. 

2.2 The EM-X Multithreaded Distributed-Memory 
Multiprocessor 

The EM-X multiprocessor is a large-scale multithreaded distributed-memory 
multiprocessor consisting of 80 custom-built processors, called EMC-Y. The 
machine was built at the Electrotechnical Laboratory and ahs been oper- 
ational since December 1995 [10, 11]. The main objective of building the 
machine is to investigate the performance of fine-grain instruction-level mul- 
tithreading. Two types of computational principles have been employed in 
designing the multiprocessor. The first level uses the data-flow principles of 
execution to realize dynamic function spawning or runtime thread invocation. 
Any processor can dynamically spawn function calls (or threads) on any 
other processor (s) including itself. This dynamic function spawning enables 
instruction-level multithreading and efficient fine-grain communication. The 
second level employs the conventional RISC-style execution to exploit pro- 
gram locality. The machine has a two-stage pipeline consisting of fetch and 
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execute. Instructions that do not require remote reads/writes and dynamic 
function spawning are executed by this two-stage pipelining. 

The cnrrent EM-X prototype has 80 EMC-Y processors connected through 
a circular Omega network. Figure 2.2 shows the prototype EM-X multipro- 
cessor. The network is a variation of Omega network which repeats a pair of 
shuffle and exchange steps. While Omega network is a multistage network, 
the EM-X is not. The main difference between Omega network and the EM-X 
network is that each processor is attached to a switch box. The EM-X net- 
work provides the maximum diameter of O(logP) for low latency and high 
throughput, where P is the number of processors. All the communication ac- 
tivities are one-sided, i.e., only one processor is involved in communication. 
The destination processor has no knowledge of any communication initiated 
by the source processor. Therefore, there is no notion of processor locking or 
sending and receiving used in the message-passing paradigm. 




Fig. 2.2. The 80-processor EM-X distributed-memory multiprocessor. 



All the communication in the EM-X are done with 2-word fixed-sized 
packets. Communication can be classified into three categories: remote read, 
remote write, and thread invocation. A remote read packet consists of two 32- 
bit words. The first word contains the destination address, from which data 
will be read at the destination processor. The second word is the returning 
address, also called continuation, which will be explained in detail shortly. A 
remote write packet also consists of two words. The first word is the destina- 
tion address while the second is the data to be written when arrived at the 
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destination processor. Thread invocation is done through packets which will 
be explained after we present some details of the processor architecture. 

Figure 2.3 shows the organization of the EMC-Y processor. A processor 
element is a single chip pipelined RISC-style processor, designed for fine-grain 
parallel computing. Each processor runs at 20 MHz with 4 MB of one-level 
static memory. The EMC-Y pipeline is designed to combine the register-based 
RISC execution principle with the packet-based dataflow execution principle 
for synchronization and message handling support. Each processor consists 
of Switching Unit (SU), Input Buffer Unit (IBU), Matching Unit (MU), Ex- 
ecution Unit (EXU), Output Buffer Unit (OBU) and Memory Control Unit 
(MCU). 




Fig. 2.3. The architecture of the EMC-Y processor. 



The Switch Unit sends/receives packets to/from the network. It consists 
of three types of components: two input ports, two output ports and a three- 
by-three cross-bar switch. Each port can transfer a packet, which consists of 
a word of address part and a word of data part, at every second cycle. A 
packet can be transferred in ft- -|- 1 cycles to the processor ft hops beyond by a 
virtual-cut-through routing. The message non-overtaking rule is enforced by 
this switch unit. 

The Input Buffer Unit receives packets from the switch unit. It has two 
levels of priority packet buffers for flexible thread scheduling. Each buffer 
is an on-chip FIFO, which can hold up to 8 packets. If the buffer becomes 
full, the packets are stored to on-memory buffer, and if the buffer is not full, 
they are automatically restored back to on-chip FIFO. The IBU operates 
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independent of the EXU and the memory unit. Packets coming in from the 
network are immediately processed without interrupting the main processor. 
The path between IBU and MCU, called by-passing direct memory access 
(DMA), is one of the key features of EM-X. This by-passing DMA together 
with the path which connects IBU to OBU is the key to servicing remote 
read/write requests without consuming the cycles of Execution Unit. 

The Matching Unit (MU) fetches the first packet in the FIFO of IBU. If 
the packet requires matching, a sequence of actions will take place to prepare 
for thread invocation by direct matching. Actions include (1) obtaining the 
base address of the activation frame for the current thread to be invoked, 
(2) loading mate data from matching memory, (3) fetching the top address of 
the template segment, (4) fetching the first instruction of the enabled thread, 
and (5) signaling the execution unit to start execution of the first instruction. 
Packets are sent out through the OBU which separates the EXU from the 
network. The MCU controls the access to the local memory off the EMC-Y 
chip. 

The Execution Unit is a register-based RISC pipeline which executes a 
thread of sequential instructions. It has 32 registers, including five special 
purpose registers. All integer instructions take one clock cycle, with the ex- 
ception of an instruction which exchanges the content of a register with the 
content of memory. Single precision floating point instructions are also exe- 
cuted in one clock cycle, except floating point division. Packet generation is 
also performed by this unit, which takes one clock cycle. Four types of send 
instructions are implemented, including remote read request for one data and 
remote read request for a block of data. 

The Output Buffer Unit receives packets generated by the EXU or IBU. 
Again, the buffer can hold up to 8 packets. As we have briefly described 
above, the key feature of the OBU is to process packets generated by IBU. 
Remote read requests received by other processors are processed by the IBU 
which uses the by-pass DMA to read data from the memory. When the data 
fetched by the IBU is given to OBU, it will be immediately sent out to the 
destination address specified in the read request packet. This internal working 
of IBU and OBU is the key feature of EM-X for fast remote read/ writes 
without consuming the main processor cycles. 

2.3 Architectural Support for Fine-Grain Multithreading 

The EM-X distributed-memory multiprocessor supports dynamic function 
spawning and multithreading both in hardware and software. Hardware sup- 
ports include thread invocation through packets, FIFO hardware scheduling 
of threads, and by-passing one-sided remote read/write. Software supports 
for multithreading include explicit context switch, global-address space, and 
register saving. Thread invocation or function spawning is done through 2- 
word-sized packets. When a thread needs to invoke a function (thread), a 
packet containing the starting address of the thread is generated and sent to 
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the destination processor. The thread which just issued the packet continues 
the computation without any interruption unless it encounters a remote read 
or explicit thread switching. 

As the thread invocation packet arrives at the destination processor, it 
will be buffered in the packet queue along with other packets arrived. Packets 
stored in the packet queue are read in the order in which they were received, 
hence First-In-First-Out (FIFO) thread scheduling. A thread of instructions 
is in turn invoked by using the address portion of the packet just dequeued. 
The thread will run to completion unless it encounters any remote memory 
operations or explicit thread switching. If the thread encounters a remote 
memory operation, it will be suspended after the remote read request is 
sent out. Should this suspension occur, any register values currently being 
used for the thread will be saved in the activation frame associated with the 
thread for resumption upon the return of the outstanding remote memory 
operation. The completion or suspension of a thread causes the next packet 
to be automatically dequeued from the packet queue using FIFO scheduling. 

Whenever a thread encounters a remote read, a packet consisting of two 
32-bit words is generated. The first 32-bit word contains the destination ad- 
dress whereas the second 32-bit contains the return address which is often 
called continuation. The read packet will be appropriately routed to the 
destination processor, where it will be stored in the input buffer unit for 
processing. The remote processor does not intervene to process the packet. 
The remote read packet will be processed through the by-passing mechanism 
which was explained earlier. When the read packet returns to the originat- 
ing processor, it will be inserted in the hardware FIFO queue for processing, 
i.e., thread resumption. Remote writes do not suspend the issuing threads. 
For each remote write, a packet is generated which consists of two 32-bit 
words. The first word is the destination memory address and the second the 
data to be written. The write instruction is treated the same as other normal 
instructions. After sending out the write packet, the thread continues. 

Software supports for multithreading include explicit context switch and 
global address space, and register saving. The current compiler supports C 
with thread library. Programs written in C with the thread library are com- 
piled into explicit-switch threads. Two storage resources are used in EM-X, 
including template segments and operand segments. The compiled codes of 
functions are stored in template segments. Invoking a function involves allo- 
cating an operand segment as an activation frame. The caller allocates the 
activation frame, deposits the argument value(s) into the frame, and sends 
its continuation as a packet to invoke the caller’s thread. The first instruc- 
tion of a thread operates on input tokens, which are loaded into two operand 
registers. The registers can hold values for one thread at a time. The current 
implementation allows no register sharing across threads, thus no implicit- 
switching support. The caller saves any live registers to the current activation 
frame before a context-switch. The continuation packet sent from the caller 
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is used to return results as in a conventional call. The result from the called 
function resumes the caller’s thread by this continuation. 

The level of thread activation and suspension can be nested and arbitrary. 
Activation frames (threads) form a tree rather than a stack, reflecting a 
dynamic calling structure. This tree of activation frames allow threads to 
spawn one to many threads on processors including itself. The level of thread 
activation/suspension is limited only by the amount of system memory. The 
EM-X compiler supports a global address space. Remote reads/writes are 
implemented through packets. A remote memory access packet uses a global 
address which consists of the processor number and the local memory address 
of the selected processor. 

3. Designing Multithreaded Algorithms 

3.1 Multithreaded Bitonic Sorting 

Bitonic sorting, introduced by Batcher [5] consists of two steps: local sort 
and merge. Given P processors and n elements, each processor holds n/p 
elements. In the local sort step, each processor takes in n/p elements and 
sorts them in an ascending order if the second bit of the processor number is 
0, and in a descending order otherwise. The merge step consists of 0(log^ P) 
steps. In each merge step, elements are sorted across processors in a pair. As 
iterations progress, the distance between the pair of processors widens. The 
last iteration will sort elements on two processors with the distance of P/2. 

Figure 3.1 illustrates bitonic sorting of n=32 elements on P=8 processors. 
Consider processors 0 and 1 at i=0, j=0. PO has L=(5,13,24,32) and PI has 
L=(6,14,23,31), resulted from the local sorting step. PO and PI will sort 8 
elements in an ascending order as indicated by shaded circles. Hollow circles 
indicate that processors sort elements in a descending order. The line between 
PO and PI indicates that the processors communicate. PO sends L to its 
mate processor PI while PI sends L to its mate PO. When PO receives four 
elements from PI, it merges them with L, so does PI. Since PO takes a lower 
position than PI, it takes the low half (5,6,13,14) while PI takes the high 
half (23,24,31,32). This type of sending, receiving, and merging operations 
continues until the 32 elements are sorted across the eight processors. 

A multithreaded version of bitonic sorting divides the inner j loop into h 
threads. Each thread is responsible for merging n/hp elements. The main idea 
of the multithreaded algorithm is to first issue remote reads by h threads, 
called thread communication parallelism, followed by the computation when- 
ever any n/hp elements are read, called thread computation parallelism. 
Reading of n/p elements is issued before any merging takes place. When- 
ever n/hp elements are read, i.e., whenever each thread finishes reading n/hp 
from the mate processor, it will merge it into its list L. This reading (com- 
munication) and merging (computation) will take place simultaneously, to 
overlap computation with communication. 
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Fig. 3.1. Bitonic sorting of n=32 elements on P=8 processors. Shaded cir- 
cles indicate those processors performing ascending order merge while the 
hollow circles indicate processors performing descending order merge. Curves 
connecting two processors indicate that each processor reads four elements 
from the mate processor. 



Figure 3.2 illustrates how two processors Px and Py sort 8 elements in 
an ascending order. For the illustration purpose we use two threads in each 
processor. Four elements are divided into two parts, each of which is assigned 
to a thread. Processors X and Y initially hold (2, 5, 6, 7) and (1,3, 4, 8), respec- 
tively. Thread 0 of Px is responsible for reading and merging the first half 
(1,3) of Py while thread 1 does for the second half (4,8). Sorting of the eight 
elements on the two processors proceeds as follows: 

1. At ta, ThdO sends out the read request RRO to Py, and suspends itself. 

2. Between ta and tb, the switch to Thdl takes place, spending several 
clocks. 

3. At tb, Thdl sends out the read request RR2 to Py, and in turn is sus- 
pended. 
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Fig. 3.2. A multithreaded version of bitonic sorting. The figure is not drawn 
to scale. Processors x and y sort 8 elements in an ascending order. Characters 
a..i indicate the time sequence. Each processor has two threads. A thread 
handles 2 elements. Communication for Px is in solid lines and for Py is in 
dotted lines. 



4. Between tb and tc, there are no threads running. Both threads are dor- 
mant. 

5. At tc, RRO returns with value 1 which will be saved in a buffer for merge. 
The value resumes ThdO. 

6. At td, RR2 returns with the value 4 but no further activities will take 
place since ThdO is currently running. 

7. At tc, ThdO sends out the read request, RRl, to Py, and then suspends 
itself. Switching to Thdl takes place. 

8. At tf, Thdl sends out the read request RR3 to Py, and in turn is sus- 
pended. 

9. Between tf and tg, there are no running threads. Both threads are in a 
suspended status, and therefore no computation takes place. Even though 
Thdl has received the value 4, it cannot perform the merge operation 
since ThdO is not complete. Merging 4 with the list will result in a wrong 
order. Thdl can proceed only after ThdO completes. This is exactly where 
sorting lacks computation parallelism across threads. As we shall see 
shortly, EFT has large computation parallelism across all threads. 

10. At tg, RRl returns with value 3. Thdl is still in the suspended status. 
ThdO has now read all the necessary elements, and immediately proceeds 
to merging the two elements with its own list. 

11. At th, RR3 returns with value 8 but no actions will take place since ThdO 
is currently running. 
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12. At ti, ThdO completes the merge, resulting in the output of (1,2,3). 
Switching to Thdl now takes place. Since Thdl also has two elements 
read from Px, it will immediately proceed to merging, which will give 
(1,2,3, 4). 

The above example assumes that each thread merges only after it reads 
n/hp consecutive elements from the mate processor. As it clear from exam- 
ple, bitonic sorting presents little computation parallelism across threads. 
Although communication can be done in parallel, computation must proceed 
in an orderly fashion so that the output buffer will contain elements sorted in 
a proper order. It should also be noted that the amount of computation for 
each processor is not the same. Thread 0 performed merge operations with 1 
and 3. However, Thread 1 performed merge operations with only one value, 
4. When Thread 1 reached sorted. Thread 1 is therefore not required to read 
the fourth element 8 from the mate processor. This irregularity in terms of 
computation occurs because not all the elements residing in the mate pro- 
cessor need to be read. The algorithmic behavior of bitonic sorting is beyond 
the scope of this report and is presented in [20] . 

3.2 Multithreaded Fast Fourier Transform 

The second problem used in this study is Fast Fourier Transform (FFT). Fig- 
ure 3.3 shows an implementation of FFT with 16 elements on four processors. 
Blocked data and workload distribution methods are used in the example. 
The 16 data elements are divided into four groups, each of which is assigned 
to a processor. Processor 0, or PO, has elements 0 to 3, PI has 4 to 7, etc. FFT 
with n elements requires log n iterations. The butterfly shown in the figure 
requires log 16 = 4 iterations. In the first iteration, each processor obtains a 
copy of four elements by finding its mate processor. Processor 0 remote reads 
four elements 8. ..11 while PI does 12. ..15 from P3. P2 and P3 also obtain 
necessary data allocated to PO and PI, respectively. 

The second iteration is essentially the same as the first iteration, except 
the logical communication distance reduces to half the first iteration. Again, 
PO remote reads from PI the four elements which have been newly computed 
by PI in iteration 0, PI reads the newly computed four elements, 0...3, from 
PO, etc. P2 and P3 again perform operations similar to what PO and PI 
did. The remaining two iterations do not requite communication since the 
required data are locally stored. In general, an FFT with blocked data dis- 
tribution of n elements on P processors requires communication for the first 
log P iterations. The remaining (log n) 0 (log P) iterations are local compu- 
tations, which do not need communication. In this report, only the first log P 
iterations are used for our experiments since they are the ones which require 
communication. 
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Fig. 3.3. FFT with 16 elements and four processors. Each processor is as- 
signed four elements using blocked data distribution. The first two iterations 
require communication while the remaining are local computation. 



Converting the single-threaded FFT to a multithreaded version is straight- 
forward. Again, the same blocked data and workload distribution are used 
in the example. Like bitonic sorting, the data assigned to each processor is 
grouped into h threads to control the thread granularity. The example in 
Figure 3.4 shows the internal working of processors 0 and 2 of Figure 3.3. 
Four elements are split into two groups, where each processor has two threads 
each of which handles two elements. 

Unlike Bitonic sorting, however, FFT possesses no data dependence be- 
tween elements within an iteration. This observation leads to computation 
whenever any data is remote-read from the mate processor. In the above ex- 
ample, the threads compute and communicate independent of other threads. 
When ThdO issues the remote read RRO, it is suspended. Processor 0 now 
switches to Thdl, which subsequently issues the remote read RR2. As RRO 
returns value 8, ThdO now proceeds to computation while RR2 is outstand- 
ing. As ThdO completes the computation with the value 8, it sends out RRl, 
followed by its suspension. Thdl immediately proceeds to computation with 
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Fig. 3.4. A multithreaded FFT, showing iteration 0. The figure is not drawn 
to scale. Each processor has two threads, each of which computes two points. 
No thread synchronization is required for FFT. 



the value 10, which has returned sometime ago. Since the value 10 is the 
only one returned, the FIFO thread scheduling allows Thdl to immediately 
proceed to computation with the value 10. Unlike bitonic sorting, no threads 
are synchronized for an orderly computation in FFT. No time is, therefore, 
lost for thread scheduling. This computation parallelism across threads will 
be evidenced by experimental results which will be shown later. 



4. Overlapping Analysis 

The multithreaded version of fine-grain bitonic sorting and FFT has been 
implemented on the EM-X. They are written in C with a thread library. 
To measure the effectiveness of overlapping capability we forced loops to 
execute synchronously by inserting a barrier at the end of each iteration. 
Several parameters are frequently used throughout this chapter. The terms 
elements and integers are used interchangeably in this chapter. The unit for 
sorting is integers while that for FFT is points. An integer is 32 bits. A point 
consists of real and imaginary parts, each of which is 32 bits. We list below 
several parameters used throughout this chapter: 

— P = the number of processors, up to 64. 

— n = the number of data elements, up to 8 M, the maximum size which 
EM-X can accommodate. 

— h = the number of threads per processor 

— m = n/hp = the number of data elements per thread. 
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Communication times are plotted in Figure 4.1. The x-axis shows the 
number of threads while the y-sods shows the absolute communication time. 
The figure presents several observations: The most important observation is 
that the communication time becomes minimal when the number of threads 
is three to four. The reason is clear. In bitonic sorting, each thread reads m 
elements from the mate processor before proceeding to the merge operation. 
The following loop shows an actual code lifted taken the program. 




Fig. 4.1. Communication time in seconds. 



for (k=0;k<m;k++) /* m = n/hp = # of elements per thread */ 

buffer [k] = remote_read(memaddr++) ; 

In each iteration, an element is read from the mate processor, assuming 
memaddr is properly initialized. After each read request, the thread is sus- 
pended and another thread is reactivated until each thread reads m elements. 
The loop body has 12 instructions, i.e., an iteration takes 12 clocks to exe- 
cute, resulting in the run length of 12. The average remote memory latency, 
when the network is normally loaded, is approximately 1 to 2 micro sec, or 
20-40 clocks. Therefore, each remote read needs two to four threads to mask 
of! the 20-40 clock latency. This is precisely why the communication time 
becomes minimum when the number of threads is two to four. The number 
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of threads higher than four does not give a notable advantage in terms of 
masking off the latency. 

The effect of multithreading is higher for FFT, as evidenced by the deep 
valleys. The run length for FFT is much higher than sorting. As we have 
explained earlier, bitonic sorting requires thread synchronization to ensure 
proper merge of elements. However, FFT is free from thread synchronization. 
Therefore, the run length for FFT is very large. The following code shows 
how multithreading is actually implemented. 

for (i=0; i<m; i++){ /* m = n/hp = the number of points per thread */ 
compute real_address and img_address; 

/* read two floats from the mate processor */ 
mate_real = remote_read(real_address++) ; 
mate_img = remote_read(img_address++) ; 

a lot of instructions with my_real, my_img, mate_real, mate_img; 

> 

Unlike sorting, FFT proceeds to computation for the elements read from 
the mate processor. There is a very large number of instructions immediately 
after the second remote read. This large amount of computations can effec- 
tively mask off the latency. This is precisely why two or three threads simply 
outperform all other threads in FFT. 

When the two problems are cross-compared, we note that sorting has 
much higher communication time than FFT. There are several reasons for the 
high communication. Among the reasons is the number of switches. Sorting 
requires thread synchronization whereas FFT does not. This thread synchro- 
nization presents a severe bottleneck as it limits the amount of computation 
parallelism across threads. Second, the sorting presents irregular computa- 
tion and communication behavior due to the fact that not all the elements 
of the mate processor are needed to complete the merge operation. FFT, on 
the other hand, requires all the elements to be read for computation. 

When the two problems are compared across different numbers of proces- 
sors, the communication pattern is relatively consistent for both sorting and 
FFT. For bitonic sorting, increasing the number of processors to 64 rarely 
changes the communication pattern. As we can see, there is little difference 
in Figure 4.1(a) and (b) for sorting, or (c) and (d) for FFT. This consistency 
in communication pattern indicates that varying the number of processors is 
not the main factor for contributing to communication patterns. It should be 
noted that the data size for each processor is the same regardless of the total 
number of processors. 

The effects of data size on communication pattern are inconsistent for 
both problems. For bitonic sorting with P=64, we find that varying data 
size rarely affects the communication performance, except for one thread. 
However, it becomes apparent for FFT. Note from Figure 4.1(d) with P=64 
that the small data size of 512K gives a steeper curve than that of 8M, except 
one thread. In other words, the curve for 512K has a valley deeper than the 
one for 8M. The reason is that the data size of 8K for each processor is just 
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too small compared to 128K. This relatively small data size is not significant 
enough to provide computations which can help mask off the communication 
latency. 

To put the communication times into the multithreading perspective, we 
identify the efficiency of overlapping. Let Tcomm,h be the communication 
time for h threads. We define the efficiency of overlapping as E = {Tcomm,i 0 
Tcomm,h)/Tcomm,i- The Communication time with one thread is used as the 
basis for overlapping analysis. When only one thread is used, there is no 
possibility that computation will overlap with communication since there 
is no other thread to switch to. Figure 4.2 shows the EM-X overlapping 
capabilities for the two problems. 




Fig. 4.2. Efficiency of overlapping 



Bitonic sorting has given roughly 35% overlapping of communication with 
computation. However, EFT has given over 95% of overlapping for two to 
four threads. This rather significant difference is attributed by two factors: 
First, bitonic sorting is sequential, presenting little parallelism among threads 
within an iteration while FFT is highly parallel. As we have explained in Sec- 
tion 3.1, communication for sorting can take place in any order but computa- 
tion must be done in an ascending of threads order to ensure proper merge. 
Thread j cannot proceed to computation before Thread i, where j > i. Syn- 
chronization between threads is required to properly sort numbers. Therefore, 
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bitonic sorting provides parallelism in remote reading only, not in computa- 
tion. Threads in FFT, on the other hand, can proceed in any order, i.e., 
computation and communication can proceed in any order. Since there is no 
dependence between elements within an iteration, thread synchronization is 
not necessary, resulting in high parallelism among threads. This parallelism 
is clearly revealed in Figure 4.2(c)-(d). 

The second reason FFT shows high overlapping efficiency is due to the 
fact that the amount of computation is much higher than that of communi- 
cation. The total amount of computations for sorting is very small, consisting 
of several comparison and merging instructions. The computations for each 
element are not more than 10 instructions excluding loop control instructions. 
On the other hand, the computations involved in each element of FFT are 
large, which include some trigonometric function computations and a loop 
to find complex roots. There is a rather large difference between the two 
programs in terms of the computation associated with each element. 



5. Analysis of Switches 

Context switches are one of the key parameters which determine the per- 
formance of multithreading. In this section, we shall look further in to the 
behavior of multithreading in terms of switches. Figure 5.1 shows the in- 
dividual execution time of the two problems. The plots have four timing 
components: computation, overhead, communication, and switching, listed 
from bottom. There is no apparent anomaly for the distribution of times, 
except for one thread. The reason that the relative execution time for one 
thread is different from others is because one thread involves no overlapping, 
which makes the relative communication time ‘look’ larger. This relatively 
large communication time in turn makes the computation time look smaller. 

Computation times for bitonic sorting are less than communication times. 
Figures 5.1(a) and (c) show that computation times change as the number of 
threads changes. In fact, the total amount of computation must not change. 
The little change is attributed by the fact that the timing measurement is 
done through a global clock. When the problem size is large, no fluctuation 
occurs since the time to measure the global clock is negligible compared to the 
overall computation time, as is evidenced by (b) and (d). The reason bitonic 
sorting gives a little higher change in computation than FFT is attributed by 
another factor. Sorting is implemented in such a way that a processor may 
or may not have to read all the elements from the mate processor. As long as 
each processor produces n/p elements, it is done with the computation and 
will go into synchronization. 

Overhead refers to the time taken to generate packets. It is essentially 
hxed not only for different numbers of processors but also for different prob- 
lems since the total number of elements allocated to each processor is the 
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Number of threads 



Number of threads 



I ] Computation time | ] Overhead | ] Comm time | ] Switch time 

Fig. 5.1. Distribution of execution time on 64 processors: The times from the 
bottom are computation time, overhead, communication time, and switching 
time. 



same. We measured the overhead by using a null loop body, i.e., the loop 
body has no computation but instructions to generate packets. We find this 
was effective to measure the overhead cost for generating packets. 

Switches are classified into three types: remote read switch, iteration syn- 
chronization switch, and thread synchronization switch. Figure 5.2 shows the 
three types of switches. The x-axis indicates the number of threads and the 
y-axis shows the absolute number of switches. The plots are drawn to the 
same scale. The figure reveals the internal working of multithreading. The 
remote read switching cost is in general the dominant factor contributing to 
the main switching cost. This is obvious because every remote read causes 
a thread switch. The remote read switching cost is fixed regardless of the 
number of threads because the number of elements to be read is indeed fixed. 
In fact, this switching can be readily derived from the given n, h, and P. 

It is clear that thread synchronization switching cost is not the main factor 
for the two problems regardless of the numbers of processors. The behavior of 
thread synchronization switching is different for the two problems. The thread 
switching cost for bitonic sorting is rather high and is close to the iteration 
synchronization switching cost. On the other hand, FFT shows that there 
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Number of threads Number of threads 



Fig. 5.2. Average number of switches for each processor. 



is a wide gap between thread and iteration synchronization switching costs. 
This gap shows that sorting spends a lot more time synchronizing threads 
within a processor. This was expected because threads in sorting are executed 
in sequence while FFT threads can execute in any order. The effects of the 
presence and absence of computation parallelism across threads are clearly 
manifested in the plots. 

Iteration synchronization switching can be as high as remote read switch- 
ing for the small problem size of 512K, as shown in Figure 5.2(a) and (c). 
As the number of threads reaches 16, the synchronization switch cost is in 
fact higher than the remote read switching cost. The reason is because the 
amount of computation is relatively small. After such small computations, 16 
threads check if other threads are done for the current iteration. In fact, the 
iteration switching cost increases logarithmically as the number of threads 
increases linearly. There is approximately an order of magnitude difference in 
the number of iteration synchronization switches. For large problems shown 
in Figure 5.2(b) and (d), the amount of computation is now 16 times higher, 
which effectively eliminates the impact of iteration synchronization switching 
cost. 

When the two problems are compared across different numbers of proces- 
sors, switching pattern changes. Remote read switch and iteration synchro- 
nization switch do not meet. Each processor now finds more computations 
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which separate the two curves. In fact, the switching cost no longer increases 
rapidly for P=64. The fluctuation for sorting with P=64 again shows that 
sorting possesses an irregular computation and communication pattern com- 
pared to FFT. 



6. Conclusions 

Reducing communication time is key to obtaining high performance on 
distributed-memory multiprocessors. Multithreading aims at reducing com- 
munication time by overlapping communication with computation. This 
chapter has presented the internal working of multithreading through empir- 
ical studies. Specifically, we have used the 80-processor EM-X multithreaded 
distributed-memory machine to demonstrate how multithreading can help 
overlap communication with computation. 

Bitonic sorting and Fast Fourier Transform have been selected to test 
the multithreading capabilities of EM-X. The criteria for the problem se- 
lection have been the computation-to-communication ratio and the amount 
of thread parallelism. Bitonic sorting has been selected for its nearly 1-to-l 
computation-to-communication ratio and the small amount of thread com- 
putation parallelism. FFT has been selected because of its high computation- 
to-communication ratio and the large amount of thread computation paral- 
lelism. Both problems have been implemented on EM-X with blocked data 
and workload distribution strategies. The data size of up to 8M integers for 
sorting and 8M points for FFT have been used. 

Experimental results have presented two key observations. First, the max- 
imum overlapping has occurred when the number of threads is two to four 
for both problems. Sorting has the run length of 12 clocks per thread, and 
therefore four threads have been found adequate to mask off the latency of 
20 to 40 clocks, or 1 to 2 micro sec. Larger numbers of threads have ad- 
versely affected the amount of overlapping due to an excessive number of 
switches. In particular, iteration synchronization switch has been found the 
main cause for excessive synchronization costs among switches and a loop. 
The run-length of FFT is very large of hundreds of clocks due to trigono- 
metric function computations. This rather high run-length has been found 
sufficient to effectively tolerate the latency of 20 to 40 clocks. 

Second, the ratio of computation to communication plays a critical role 
in tolerating latency. Bitonic sorting results have shown that the maximum 
overlap has reached approximately 35%. The reason for the low overlapping 
was because bitonic sorting has small absolute computation time and lacks 
thread computation parallelism, requiring thread synchronization. FFT, on 
the other hand, has shown over 95% of communication overlapping due to 
its high computation-to-communication ratio and the large amount of both 
thread computation and communication parallelism. FFT threads can com- 
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pute and communicate in any order within an iteration, requiring no thread 
synchronization. 

The study has indicated that fine-grain multithreading can hold a key to 
obtaining high performance on distributed-memory machines. The fact that 
multithreading can tolerate over 35% of the total communication time for 
sorting at the absence of computation parallelism clearly demonstrates such 
premise. Problems which possess irregular computation behavior and moder- 
ate parallelism can be a logical target for obtaining high performance through 
multithreading. We believe it is a realistic goal to achieve high overlapping for 
such irregular problems if the thread scheduling and synchronization mecha- 
nisms are fine tuned to thread computation and communication parallelism. 
It is our next goal to fine-tune mechanisms for hardware thread scheduling 
and synchronization. 
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Summary. For data-parallel languages such as High Performance Fortran to 
achieve wide acceptance, parallelizing compilers must be able to provide consis- 
tently high performance for a broad spectrum of scientific applications. Although 
compilation of regular data-parallel applications for message-passing systems have 
been widely studied, current state-of-the-art compilers implement only a small num- 
ber of key optimizations, and the implementations generally focus on optimizing 
programs using a “case-based” approach. For these reasons, current compilers are 
unable to provide consistently high levels of performance. In this paper, we de- 
scribe techniques developed in the Rice dHPF compiler to address key code genera- 
tion challenges that arise in achieving high performance for regular applications on 
message-passing systems. We focus on techniques required to implement advanced 
optimizations and to achieve consistently high performance with existing optimiza- 
tions. Many of the core communication analysis and code generation algorithms in 
dHPF are expressed in terms of abstract equations manipulating integer sets. This 
approach enables general and yet simple implementations of sophisticated optimiza- 
tions, making it more practical to include a comprehensive set of optimizations in 
data-parallel compilers. It also enables the compiler to support much more aggres- 
sive computation partitioning algorithms than in previous compilers. We therefore 
believe this approach can provide higher and more consistent levels of performance 
than are available today. 



1. Introduction 

Data-parallel languages such as High-Performance Fortran (HPF) [29, 31] 
aim to make parallel scientific computing accessible to a much wider audi- 
ence by providing a simple, portable, abstract programming model applica- 
ble to a wide variety of parallel computing systems. For such languages to 
achieve wide acceptance, it will be essential to have parallelizing compilers 
that provide consistently high performance for a broad spectrum of scientific 
applications. To achieve the desired levels of performance and consistency, 
compilers must necessarily exploit a wide variety of optimization techniques 
and effectively apply them to programs with as few restrictions as possible. 

Engineering HPF compilers that provide consistently high performance 
for a wide range of programs is a challenging task. The data layout directives 
in an HPF program provide an abstract, high-level specification of maxi- 
mal data-parallelism and data-access locality. The compiler must use this 
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information to choose how to partition the computation among processors, 
determine what data movement and synchronization is necessary, and gener- 
ate code to implement the partitioning, communication and synchronization. 
Accounting for interactions and feedback among these steps complicates pro- 
gram analysis and code generation. To achieve high efficiency, optimizations 
must analyze and transform the program globally within procedures, and 
often interprocedurally as well. 

The most widely studied sub-problem of data-parallel compilation is that 
of compiling “regular” data-parallel applications on message-passing systems. 
Data-parallel programs are known as “regular” if the mapping of each array’s 
data elements to processors can be described by an affine mapping function 
and the array sections accessed by each array reference can be computed 
symbolically at compile time. Even within this class of applications, state-of- 
the-art commercial and research compilers do not consistently achieve perfor- 
mance competitive with hand-written code [16,24]. Although many important 
optimizations for such systems have been proposed by previous researchers, 
current compilers implement only a small fraction of these optimizations, gen- 
erally focusing on the most fundamental ones such as static loop partitioning 
based on the “owner-computes” rule [39], moving messages out of loops, re- 
ducing the number of data copies, and exploiting collective communication. 
Furthermore, even for these optimizations, most research and commercial 
data-parallel compilers to date [7,10,15-17,19,24,32,33,35,42,45,46] (in- 
cluding the Rice Fortran 77D compiler [24]) perform communication analysis 
and code generation for specific combinations of the form of references, data 
layouts and computation partitionings. While such “case-based” approaches 
can provide excellent performance where they apply, they will provide poor 
performance for cases that have not been explicitly considered. More impor- 
tantly, case-based compilers require a relatively high development cost for 
each new optimization because the analysis and code generation for each 
case is handled separately; this makes it difficult to achieve wide coverage 
with optimizations, which in turn makes it difficult to offer consistently high 
performance. 

In this paper, we describe techniques to address key code generation chal- 
lenges that arise in a sophisticated compiler for regular data-parallel applica- 
tions on message-passing systems. We focus on techniques required to imple- 
ment advanced optimizations and to achieve consistently high performance 
with existing optimizations. With minor exceptions, these techniques have 
been implemented in the Rice dHPF compiler, an experimental research com- 
piler for High Performance Fortran. Although this paper focuses on compila- 
tion techniques for regular problems on message-passing systems, the dHPF 
compiler is being designed to integrate handling for regular and irregular 
applications, and to target other architectures including shared-memory and 
hybrid systems. 
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The principal code generation challenges we address are the following: 

— Flexible computation partitionings: Higher performance can be achieved if 
compilers can go beyond the widely-used owner-computes rule to support a 
much more flexible class of computation partitionings. More general com- 
putation partitionings require two key compiler enhancements. First, they 
require robust communication analysis techniques that are not limited to 
specific partitioning assumptions. Second, they also require sophisticated 
code generation techniques to guarantee correctness in the presence of ar- 
bitrary control-flow, and to generate partitioned code with good scalar 
efficiency. The dHPF compiler supports a much more general computation 
partitioning (CP) model than in previous data-parallel compilers. We de- 
scribe the communication analysis and code generation techniques required 
to support this model. 

— Robust algorithms for communieation and eode generation: The core com- 
munication analysis, optimization, and code generation algorithms in dHPF 
are expressed in terms of abstract equations manipulating integer sets 
rather than as a collection of strategies for different cases. Optimizations we 
have formulated in this manner include message vectorization [14], message 
coalescing [43] , recognizing in-place communication [1] , code generation for 
our general CP model [1], non-local index set splitting [32], control-flow 
simplification [34], and generalized loop-peeling for improving parallelism. 
By formulating these algorithms in terms of operations on integer sets, 
we are able to abstract away the details of the CPs, references, and data 
layouts for each problem instance. All of these algorithms fully support 
our general computation partitioning model, and can be used for arbitrary 
combinations of computation partitionings, data layouts and affine refer- 
ence subscripts. 

— Simplifying compiler-generated eontrol-flow: Loop splitting transforma- 
tions, performed to minimize the dynamic cost of adding new guards, pro- 
duce new loops with smaller iteration spaces which can render existing 
guards and loops inside redundant or infeasible. This raises the need for an 
algorithm that can determine the symbolic constraints that hold at each 
control point and use them to simplify or eliminate branches and loops 
in the generated code. We motivate and briefly describe a powerful algo- 
rithm for constraint propagation and control flow simplification used in 
the dHPF compiler [34]. In a preliminary evaluation, the algorithm has 
proven highly effective at eliminating excess control-flow in the generated 
code. Furthermore, we find that the general purpose control-flow simpli- 
fication algorithm provides some or all of the benefits of special-purpose 
optimizations such as vector-message pipelining [43] and overlap areas [14]. 

Our aim in this chapter is to motivate and provide an overview of the 
techniques we use to address the challenges described above. The algo- 
rithms underlying these techniques are described and evaluated in detail 
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elsewhere [1,34]. In the following section, we use an example to describe 
the basic steps in generating an explicit message-passing parallel program 
for HPF. We also use the example to show why integer sets are fundamen- 
tal to the problem of HPF compilation, and describe the approaches used 
in previous data-parallel compilers for computing and generating code from 
integer sets. In Section 3 we describe a general integer-set framework under- 
lying HPF compilation, and the implementation of this framework in dHPF. 
The framework directly supports integer set based algorithms for many of 
our optimizations, and these are briefly described in the subsequent sections. 
In Section 4, we define our general computation partitioning model and how 
we support code generation for it. In Section 5, we provide an overview of the 
principal communication optimizations in dHPF and then briefly describe the 
key optimizations that were formulated in terms of the integer set framework. 
In Section 6, we motivate and describe control-flow simplification in dHPF, 
and present a brief evaluation of its effectiveness. Finally, in Section 7, we 
conclude with a brief summary and discussion of the techniques described in 
this chapter. 



2. Background: The Code Generation Problem for HPF 

The High Performance Fortran standard describes a number of extensions to 
Fortran 90 to guide compiler parallelization for parallel systems. The language 
is discussed in some detail in an earlier chapter [29], and we assume the 
reader is familiar with the major aspects of the language (particularly, the 
data distribution directives). Throughout this paper, we assume a message- 
passing target system although many of the same analyses are required or 
profitable for shared-memory systems as well. 

2.1 Communication Analysis and Code Generation for HPF 

To understand the basic problem of compiling an HPF program into an 
explicitly parallel message-passing program, and to motivate our use of a 
general integer-set framework for analysis and code generation, consider the 
simple example in Figure 2.1. The source loop represents a nearest-neighbor 
stencil computation similar to those found in partial differential equation 
solvers. The two arrays are aligned with each other and both are distributed 
(block, block) on a two-dimensional processor array. To generate an explic- 
itly parallel code for the program, the compiler must first decide (a) how 
to partition the computation for each statement in the program, (b) which 
references might access non-local data due to the chosen partitioning, and 
(c) how and when to instantiate the communication to obtain this non-local 
data. 

Assume the compiler chooses an “owner-computes” partitioning for the 
statement in the loop, i.e., each instance of the statement is executed by the 
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CHPF$ processors P(3,3) 

CHPF$ distribute ACblock, block) onto P 
CHPF$ distribute B(block, block) onto P 
do 10 j=2,N-l 
do 10 i=2,N-l 
A(i,j) = 0.25* 

(B(i-l,j)+B(i+l,j) + 

B(i,j-1)+B(i,j + D) 

10 continue 

(a) HPF source code (b) Processor array P(3,3) 



no 

Ci 




P(U) 


P(l,2) 


P(l,3) 


P(2,l) 


P(2,2) 


P(2,3) 


P(3,l) 


P(3,2) 


P(3,3) 



(c) Communication and iteration sets 

Fig. 2.1. Example illustrating integer sets required for code generation for 
Jacobi kernel 
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processors that own the value being computed, viz., A(i, j ) . In this case, each 
of the four references on the right hand side (RHS) accesses some off-processor 
elements, namely the boundary elements on the four neighboring processors. 
Since the array B is not modified inside the loop nest, communication for 
these references can be moved out of the loops and placed before the loop 
nest. 

In order to generate efhcient explicitly parallel SPMD code, the compiler 
must compute the following quantities, and then use these to generate code. 
These quantities are illustrated in Figure 2.1: 

1. the sections of each array allocated to (owned by) each processor; 

2. the set of iterations to be executed by each processor (conforming with 
the owned section of array A); 

3. the non-local data accessed from each other processor by each reference 
(the off-processor boundary sections shown in the figure); 

4. the iterations that access non-local data and the iterations that access 
exclusively local data. (These sets are used by advanced optimizations 
such as those described in Section 5.3.) 

All of these quantities can be symbolically represented as sets of integer 
tuples (representing array indices, loop iterations, or processor indices), or 
as mappings between integer tuples (e.g., an array layout is a mapping from 
processor indices to array indices). These sets and mappings are defined in 
Section 3. The sets may be non-convex, as is the set of iterations accessing 
non-local data shown in Figure 2.1. 

To generate a statically partitioned, message-passing program, any data- 
parallel compiler must implicitly or explicitly compute the above sets, and 
then use these sets to generate code. The compiler typically generates SPMD 
code for a representative processor myid by performing the following tasks 
in some order (the resulting code is omitted here): 

1. Synthesize a loop nest to execute the iterations assigned to myid. 

2. For each message, if explicit buffering of data is necessary, synthesize 
code to pack a buffer at the sending processor and/or to unpack a buffer 
at the receiving processor. 

3. Allocate storage to hold the non-local data, and modify the code to access 
data out of this storage (note that different references access non-local 
data in different iterations). 

4. Allocate storage for the local sections for each array, and modify array 
references (for local data) to index appropriately into these local sections. 



2.2 Previous Approaches to Communication Analysis and Code 
Generation 

To compute the above sets and to generate code using them, the primary 
approach in most previous research and commercial compilers has been to 
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focus on individual common cases and to precompute the iteration and com- 
munication sets symbolically for specific forms of references, data layouts and 
computation partitionings [7,10,15-17,19,24,32,33,42,46]. For example, to 
implement the Kali compiler [32], the authors pre-compute symbolically the 
iteration sets and communication sets for subscripts of the form c, i + c and 
c0f, where f is a loop index variable for BLOCK and CYCLIC distributions. 
The Fortran 77D compiler also handled the same classes of references and 
distributions, but computed special-case expressions for the iteration sets for 
“interior” and “boundary” processors [24]. (In fact, both these groups have 
described basic compilation steps in terms of abstract set operations [23,32]; 
however, this was used only as a pedagogical abstraction and the correspond- 
ing compilers were implemented using case-based analysis.) Li and Chen de- 
scribe algorithms to classify communication caused by more general reference 
patterns (assuming aligned arrays and the owner-computes rule), and gen- 
erate code to realize these patterns efficiently on a target machine [33]. In 
general, these compilers focus on providing specific optimizations aimed at 
cases that are considered to be the most common and most important. The 
principal benefits of such case-based strategies are that they are conceptually 
simple and hence lend themselves well to initial implementations, they have 
predictable and fast running times, and they can provide excellent perfor- 
mance in cases where they apply. 

Three groups have used a more abstract and general approach based on 
linear inequalities to support code generation for communication and iter- 
ation sets [2-5]. In this approach, each code generation or communication 
optimization problem is described by a collection of linear inequalities repre- 
senting integer sets or mappings. Fourier-Motzkin elimination [41] is used to 
simplify the resulting inequalities, and to compute a range of values for indi- 
vidual index variables that together enumerate the integer points described 
by these inequalities. Code generation then amounts to generating loops to 
iterate over these index ranges. In the PIPS and Paradigm compilers, these 
techniques were primarily used for code generation for communication and 
iteration sets [3,5]. In the SUIF compiler, these techniques were also applied 
to carry out specific optimizations including message vectorization, message 
coalescing (limited to certain combinations of references) and redundant mes- 
sage elimination [2]. 

The advantage of using linear inequalities over case-based approaches is 
that each optimization or code generation problem can be expressed and 
solved in abstract terms, independent of the specific forms of references, data 
layouts, and computation partitionings. Furthermore, Fourier-Motzkin elim- 
ination is applicable to arbitrary affine references and data layouts. The pri- 
mary limitation of linear-inequality based approaches in previous compilers 
is that they have limited their focus to problems that can be represented by 
the intersection of a single set of inequalities. This limited the scope of their 
techniques so that, for example, they would be unable to support our gen- 
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eral computation partitioning model, coalescing communication for arbitrary 
affine references, or general loop-splitting into local and non-local iterations. 
We considered all of these capabilities to be important goals for the dHPF 
compiler. A second drawback of linear-inequality based approaches is that 
each problem or optimization is expressed directly in terms of large collec- 
tions of inequalities which must be constructed to represent operations such 
as intersections and compositions of sets and mappings. It appears much eas- 
ier and more intuitive to express complex optimizations directly in terms of 
sequences of abstract set operations on integer sets, as shown in [1]. 

There is also a large body of work on techniques to enumerate com- 
munication sets and iteration sets in the presence of cyclic(A:) distributions 
(e.g., [12,17,28,35,45]). Compared to more general approaches based on 
integer sets or linear inequalities, these techniques likely provide more effi- 
cient support for cyclic(A:) distributions, particularly when fc > 1, but would 
be much less efficient for simpler distributions, and are much less general 
in the forms of references and computation partitionings they could handle. 
Our goal has been to base the dHPF compiler on a general analysis frame- 
work that provides good performance in the vast majority of common cases 
(within regular distributions), and requires such special-purpose techniques 
as infrequently as possible. Such techniques can be added as special-purpose 
optimizations in conjunction with the integer-set framework, but even in the 
absence of these techniques, we expect that the set framework itself will pro- 
vide acceptably efficient support for cyclic(fc) distributions. 

To summarize, we believe there are two essential advantages and one sig- 
nificant disadvantage of the more general and abstract approaches based on 
linear inequalities or integer-sets, compared with the case-based approaches. 
First, the former use general algorithms that handle the entire class of regular 
problems fairly well, whereas case-based approaches apply more narrowly and 
must fall back more often on inefficient techniques such as run-time resolution 
for cases they do not handle. In the absence of special-case algorithms, the 
general approaches are likely to provide much higher performance. Support 
for exploiting special-cases (e.g., for using collective communication primi- 
tives) can be added to the the former if they provide substantial performance 
improvements, but they should be needed in very few cases. Second, the more 
abstract framework provided by linear inequalities (to some extent) and by 
integer sets (to a greater extent) greatly simplifies the compiler-writer’s task 
of implementing important optimizations that are generally applicable, and 
therefore make it practical to achieve high performance for a wider class of 
programs. By combining both generality and simplicity, we believe an ap- 
proach such as that of using integer sets can provide higher and more consis- 
tent levels of performance than is available today. In contrast, the principal 
advantages of case-based approaches are that preliminary implementations 
can be simple, and that they typically have fast and predictable running 
times. For more general approaches, running time is the greatest concern 
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since manipulation of linear inequalities and integer sets can be costly and 
unpredictable in difficult cases. This issue is discussed in more detail in later 
sections. 



3. An Integer Set Framework for Data-Parallel 
Compilation 

As discussed in the previous section, any compiler for a data-parallel lan- 
guage based on data distributions can be viewed as operating in terms of 
some fundamental sets of integer tuples and mappings between these sets. 
This formulation is made explicit in dHPF and in the SUIT compiler [2], and 
similar formulations have been discussed elsewhere [23,32]. The integer set 
framework in dHPF includes the representation of these primitive quantities 
as integer tuple sets and mappings, together with the operations to manipu- 
late them and generate code from them. The core optimizations in dHPF are 
implemented directly on this framework. This section explains the primitive 
components of the framework, and the implementation of the framework. The 
following sections describe the optimizations formulated using the framework. 



3.1 Primitive Components of the Framework 



An integer /c-tuple is a point in Z^] a tuple space of rank /c is a subset of 
. Any compiler for a data-parallel language based on data distributions 
operates primarily on three types of tuple spaces, and the three pairwise 
mappings between these tuple spaces [2,23,32]. These are:^ 



datak 

loopk 

prock 

Layout 

Ref 

CPMap 



index set of an array of rank k,k 0 
iteration space of a loop nest of depth k,k 0 
processor index space in a processor array of rank k,k 1 
b] ^ [®] : P T procn owns array element a | datak 
[i] ^ [«] : i T loopk references array element a | datak 
[p] ^ [i] : p T PfoCn executes statement instance i | loopk 



Scalar quantities such as a “data set” for a scalar, or the “iteration set” 
for a statement not enclosed in any loop are handled uniformly within the 
framework as tuples of rank zero.^ For example, the computation partitioning 
for a statement (outside any loop) assigned to processor P in a 1-D processor 

^ We use names with lower-case initial letters for tuple sets and upper-case letters 
for mappings respectively. 

^ A set of rank 0, {[] : f{vi, . . . ,Vn)}, should be interpreted as a boolean that 
takes the values true or false, depending on whether the constraints given by 
/(ui, . . . , Vn) are satisfied. Here vi . . .Vn are symbolic integer variables. 
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array would be represented as the mapping [] ^ [p\ ■. p = P . Hereafter, 
the terms “array” and “iterations of a statement” should be thought of as 
including scalars and outermost statements as well. Note that any mapping 
we require, including a mapping with domain of rank 0, will be invertible. 

Of these primitive sets and mappings, the sets loop and proc and the 
mappings Layout and Ref are constructed directly from the compiler’s in- 
termediate representation, and form the primary inputs for further analyses. 
These quantities are constructed from a powerful symbolic representation 
used in dHPF, namely global value numbering. A value number in dHPF is 
a handle for a symbolic expression tree. Value numbers are constructed from 
dataflow analysis of the program based on its Static Single Assignment (SSA) 
form [13], such that any two subexpressions that are known to have identical 
runtime values are assigned the same value number [21]. Their construction 
subsumes expression simplification, constant propagation, auxiliary induc- 
tion variable recognition, and computing range information for expressions 
of loop index variables. A value number can be reconstituted back into an 
equivalent code fragment that represents the value. 

Figure 3.1 illustrates simple examples of the primitive sets and mappings 
for an example HPF code fragment. For clarity, we use different set variables 
to denote points in different tuple spaces. The construction of the Layout 
mapping follows the two steps used to describe an array layout in HPF [31], 
namely, alignment of the array with a template and distribution of the tem- 
plate on a physical processor array (the template and processor array are 
each represented by a separate tuple space). The ON_home CP notation 
and construction of CPMap are described in Section 4. 

3.2 Implementation of the Framework 

Expressing optimizations in terms of this framework requires an integer set 
package that supports all of the key set and map operations including inter- 
section, union, difference, domain, range, composition of a map with another 
map or set, and projection to eliminate a variable from a map or set. We use 
the Omega library developed by Pugh et al. at the University of Maryland for 
this purpose [27]. The library operations use powerful algorithms based on 
Fourier-Motzkin elimination for manipulating integer tuple sets represented 
by Presburger formulae [37]. In particular, the library provides two key ca- 
pabilities: it supports a general class of integer set operations including set 
union, and it provides an algorithm to generate efficient code that enumerates 
points in a given sequence of iteration spaces associated with a sequence of 
statements in a loop [26] . (Appendix A describes this code generation capa- 
bility.) These capabilities are an invaluable asset for implementing set-based 
versions of the core HPF compiler optimizations as well as enabling a variety 
of interesting new optimizations, described in later sections. 
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real A(0:99,100), B(100,100) 
processors P(4) 
template T(100,100) 
align A(i,j) with T(i+l,j) 
align B(i,j) with T(|,i) 
distribute t(*, block) onto P 

read(*), N 
do i = 1, N 
do j = 2, N+1 

A(i,j) = B(j-l,i) ! ON_HOME B(j-l,i) 

enddo 

enddo 

symbolic N 

AligriA = 1f[ai, oi] ^ [^ 1 ,^ 2 ] : = oi + 1 /"t 2 = a 2 <> 

Aligns = ^ 2 ] ^ [^ 1 ,^ 2 ] : ^2 = ^lO 

Distr = 1 f[ii,i 2 ] ^ [p] : 25p+ 1 ~ t 2 — 25(p+ 1) /"O ~ p ~ 30 
LayoutA = Dist®^ ^^lign'^ 

= Ifb] ^ [«i, 02 ] : max{25p + 1, 1) ~ 02 — min{25p + 25, 100) ^ 

0 ~ ai ~ 990 

Layouts = Dist®^ <=Align'^^ 

= Ifb] ^ 2 ] : max{25p + 1, 1) ~ 61 ~ min{2bp + 25, 100) 

1 ~ 62 ^ 1000 

loop = to, 0 ] : 1 - 0 - /2 ~ 0 - + 10 

CP Ref = to, 0] ^ [^ 1 , ^ 2 ] : ^2 = 0 /'bi = 0 C> 10 

CPMap = Layouts <=CPRef®^ loop 

= tp] ^ [0, 0] : 1 - 0 - min{N, 100) /" 

max{2, 25p + 2) ~ 0 — min{N + 1, 101, 25p+ 26)0 

Fig. 3.1. Construction of primitive sets and mappings for an example pro- 
gram. Align A, Aligns, and Distx also include constraints for the array and 
template ranges, but these have been omitted here for brevity. 



One potentially significant disadvantage of using such a general represen- 
tation is the compile-time cost of the algorithms used in Omega. In partic- 
ular, simplification of formulae in Presburger arithmetic can be extremely 
costly in the worst-case [36]. Pugh has shown, however, that when the un- 
derlying algorithms in Omega (for Fourier-Motzkin elimination) are applied 
to dependence analysis, the execution time is quite small even for complex 
constraints with coupled subscripts and also for synthetic problems known 
to cause poor performance [37]. These experimental results at least provide 
evidence that the basic techniques could be practical for use in a compiler. In 
dHPF, the Omega library has already proved to be a powerful tool for proto- 
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typing advanced optimizations based on the integer set framework. On small 
benchmarks, the compiler provides acceptably fast running times, for exam- 
ple, requiring about 3 minutes on a SparcStation-10 to compile the Spec92 
benchmark Tomcatv with all optimizations. Further evidence for a variety of 
real applications will be required to judge whether or not this technology for 
implementing the integer set framework will prove practical for commercial 
data-parallel compilers. The dHPF implementation will provide a testbed for 
developing this evidence. If this approach does not prove practical, it is still 
possible that a simpler and more efScient underlying set representation could 
be used to support the same abstract formulation of optimizations, but with 
some loss of precision. 

Another significant and fundamental limitation is that Presburger arith- 
metic is undecidable in the presence of multiplication. For this reason, the 
Omega library provides only limited support for handling multiplication, and 
in particular, cannot represent sets with an unknown (i.e., symbolic) stride. 
Most importantly (from the viewpoint of HPF compilation), such strided sets 
are required for any HPF distribution when the number of processors is not 
known at compile time, and for a cyclic(fc) distribution with unknown k. We 
have extended our framework to permit these parameters to be symbolic, as 
described below. Symbolic strides also arise for a loop with a non-constant 
stride or a subscript expression with a non-constant coefficient, although we 
expect these to be rare in practice. These are not supported by our frame- 
work, and would have to fall back on more expensive run-time techniques 
such as a finite-state-machine approach for computing communication and 
iteration sets (for example, [28]), or an inspector-executor approach. 

To permit a symbolic number of processors or cyclic(fc) distribution with 
symbolic k, we use a virtual processor (VP) model that naturally matches the 
semantics of templates in HPF [22]. The VP model uses a virtual processor 
array for each physical processor array, using template indices (i.e., ignor- 
ing the distribute directive) in dimensions where the block size or number 
of processors is unknown, but using physical processor indices in all other 
dimensions. Using physical processor indices where possible facilitates better 
analysis and improves the efficiency of generated code. All of the analyses 
described in the following sections operate unchanged on physical or virtual 
processor domains. During code generation for each specific problem (e.g., 
generating a partitioned loop) , we add extra enclosing loops that iterate over 
the VPs that are owned by the relevant physical processor (e.g., the represen- 
tative processor my id). For each problem, we use an additional optimization 
step (consisting of a few extra integer set equations) to compute the precise 
set of iterations required for these extra loops, and therefore to minimize 
the runtime overhead in the resulting code. The details of our extensions to 
handle a symbolic number of processors are given in [1]. 
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4. Computation Partitioning 

A computation partitioning (CP) for a statement is a precise specification 
of which processor or processors must execute each dynamic instance of the 
statement. The CPs chosen by the compiler play a fundamental role in deter- 
mining the performance of the resulting code. For the compiler to have the 
freedom to choose a partitioning well-suited to an application’s needs, the 
communication analysis and code generation phases of the compiler must be 
able to support a flexible class of computation partitionings. In this section, 
we describe the computation partitioning model provided by dHPF and the 
code generation framework used to support the model. The communication 
analysis techniques supporting the model are described in Section 5. 



4.1 Computation Partitioning Models 

Most research and commercial compilers for HPF to date primarily use 
the owner- computes rule [39] to partition a computation. This rule speci- 
fies that each value assigned by a statement is computed by the owner (i.e., 
the “home”) of the location being assigned the value, e.g., the left hand 
side (LHS) in an assignment. The owner-computes rule amounts to a simple 
heuristic choice for partitioning a computation. It is straightforward to show 
that this approach is not optimal in general [12]. An alternate partitioning 
strategy used by SUIF [2] and Barua, Kranz & Agarwal [6] requires a CP 
to be described by a single affine mapping of iterations to processors, and 
assigns a single CP to an entire loop iteration and not to individual state- 
ments in a loop. This strategy is also not optimal, because (in general) it 
does not permit optimal CPs to be chosen separately for different statements 
in a loop. 

A major goal of the dHPF compiler is to support more general computa- 
tion partitionings. Doing so requires new support from the compiler’s com- 
munication analysis and code-generation phases. Previous compilers based 
on the owner-computes rule have benefited from two key simplifying assump- 
tions: (1) communication patterns are defined by a single pair of LHS and 
right-hand-side (RHS) references, and (2) all communication is caused by 
reads of non-local data. The SUIF partitioning model also has the benefit 
that each communication is defined by a single reference and a single CP 
mapping (before coalescing communication), although write references can 
cause communication too. This model has the additional benefit that code 
generation is greatly simplified by having a common CP for all statements 
in a loop (which will become clear from the discussion in Section 4.2). None 
of these simplifying assumptions are true for the more general partitioning 
model used in dHPF. 

4.1.1 The Computation Partitioning Model in dHPF. The compu- 
tation partitioning model supported by dHPF combines and generalizes the 
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features of both previous CP models described above (the owner-computes 
rule and the SUIF model). Below we describe the key features of the dHPF 
CP model including the implicit CP representation used by early phases in 
the compilation and conversion to an explicit CP representation required for 
communication analysis and code generation. In Section 4.2.1, we discuss 
code generation for these general CPs and in Section 5 we discuss the role of 
CPs in communication analysis. 

In dHPF, a computation partitioning for a statement can be specified as 
the set of owners of the locations accessed by one or more arbitrary data ref- 
erences, Every statement (including control flow statements) can be assigned 
a partitioning independent of every other statement, restricted only to pre- 
serving the semantics of the source program. For a statement S enclosed in 
a loop nest with iteration space i, the CP of S is specified by a union of one 
or more ON_HOME terms: 

k=n 

CP{S) : \ ON_HOME Akifkii)) (4.1) 

fej=l 

An individual term on_home specifies that the instance of S in 

iteration i is to be executed by the processor(s) that own the array element(s) 
Akifkii))- This set of processors is uniquely specified by subscript vector fk{i) 
and the layout of array Ak at that point in the execution of the program.^ 
This implicit representation of a computation partitioning supports arbitrary 
index expressions or any set of values in each index position in fk{i)- A data 
reference Ak{fk{i)) in an on_home clause need not be a reference existing 
in the program. Even the variable Ak and its corresponding data layout may 
be synthesized for representing a desired CP, though our implementation 
is restricted to legal HPF layouts. With this representation, the CP model 
permits specification of a very wide class of partitionings. 

While early analysis phases in the dHPF compiler use this implicit CP 
representation, communication analysis and code generation require that the 
CP for each statement is converted into an explicit mapping of type CPMap 
defined in Section 3.1. The integer set framework is used to construct this 
explicit mapping. This construction requires that each subscript expression 
in fkii) be an afhne expression of the index variables, i, with known constant 
coefficients, or a strided range specifiable by a triplet lb:ub:step with known 
constant step. We construct the explicit integer tuple mapping representing 
the CP for a statement as follows. 

k=n 

CPMapiS) = \ {Layout Ak loop. (4.2) 

feii 

^ In the presence of dynamic REALIGN and REDISTRIBUTE directives, we 
assume that only a single known layout is possible for each reference in the 
program. Multiple reaching layouts would require generating multi- version code 
or assuming that the layout is unknown nntil run time (as done for inherited 
layonts) . 
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For each term on_home Ak{fk{£)) in CP{S), the composition of its layout 
and inverse reference maps results in a new map that specifies all possible 
iterations assigned to each processor by this CP term. We restrict the range 
of this map to the iteration space given by loop. Taking the union over all 
CP terms gives the full mapping of iterations to processors for statement 
S. CPMap{S) specifies the processor assignment for the single instance of 
statement S in loop iteration i. Figure 3.1 shows a simple example of the 
construction of CPMap. The mapping can be vectorized over the range of 
iterations of one or more enclosing loops to represent the combined processor 
assignment for the set of statement instances in those loop iterations. 

Careful assignment of CPs to control- flow related statements (namely, DO, 
IF, and GOTO statements, as well as labeled branch targets) is necessary to 
preserve the semantics of the source program. In particular, a legal partition- 
ing must ensure that each statement in the program is reached by a superset 
of the processors that need to participate in its execution, as specified by its 
CP. The code generation phase will then ensure that the statement is exe- 
cuted by exactly the processors specified by the CP. The algorithms dHPF 
uses to select computation partitions and ensure legality are beyond the scope 
of this paper. In Section 4.2.1, we discuss the interaction between correctness 
constraints on CP assignments for control-flow related statements and code 
generation. 

To support dHPF’s general computation partitioning model, the com- 
munication analysis and code-generation phases in the compiler must fully 
support any legal partitioning. Supporting this partitioning model would be 
impractical using a case-based approach; the dHPF compiler’s representa- 
tion of computation partitionings using an abstract integer set framework 
has proven essential for making the required analysis and code generation 
capabilities practical. 

4.2 Code Generation to Realize Compntation Partitions 

A general CP model such as that in dHPF poses several challenges for static 
code generation. First, the code generator must ensure correctness in the 
presence of arbitrary structured and unstructured control flow, without sac- 
rificing available parallelism. Second, generating efficient parallel code for a 
loop nest containing multiple statements that potentially have different it- 
eration spaces is an intrinsically difficult problem. Previous compilers use 
simple approaches for code generation and do not solve this problem in its 
general form (as described briefly below), but Kelly, Pugh & Rosser have 
developed an aggressive algorithm for “multiple-mappings code generation” 
which directly tackles this problem [26]. A third and related difficulty, how- 
ever, is that good algorithms for generating efficient code (like that of Kelly, 
Pugh & Rosser) will be inherently expensive because of the potential com- 
plexity of the iteration spaces and the resulting code. Ensuring reasonable 
compile times requires an effective strategy to control the compile-time cost 
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of applying such an algorithm while still producing as high quality code as 
possible. A fourth problem (not directly related to a general CP model) is 
that static code generation techniques will not be useful for a code with ir- 
regular or complex partitionings. Such cases require runtime strategies such 
as the inspector-executor approach (e.g., [18,32,40]). However, regular and 
irregular partitionings may coexist in the same program, and perhaps even in 
a single loop nest. This raises the need for a flexible code generation frame- 
work that allows each part of the source program to be partitioned using the 
most efficient strategy applicable. 

Before describing the techniques used in dHPF to address these challenges, 
we briefly describe the strategies used to realize computation partitions in 
current compilers, and the limitations of these strategies in addressing these 
challenges. We begin with the second of the four problems described above 
because the approaches to addressing this problem have implications for the 
other problems as well. As in the rest of the paper, we focus on compile- 
time techniques that are applicable when the compiler can compute static 
(symbolic) mappings between processors and the data they own or commu- 
nicate. If a static mapping is not computable at compile time, alternative 
strategies that can be used include run-time resolution [39], the inspector- 
executor approach, and run-time techniques for handling cyclic (k) parti- 
tionings. The latter two approaches are described in other chapters within 
this volume [38,40]. 

It is relatively straightforward to partition a simple loop nest that con- 
tains a single statement or a sequence of statements with the same CP. The 
loop bounds can be reduced so that each processor will execute a smaller it- 
eration space that contains only the statement instances the processor needs 
to execute. For loops containing multiple statements with different CPs, it 
is important that each processor execute as few guards as possible to deter- 
mine which statement instances it must execute in each iteration. Previous 
compilers, such as IBM’s pHPF compiler [16], rely on loop distribution to 
construct separate loop nests containing statements with identical CPs, so 
as to avoid the need for run-time guards. There are two drawbacks to using 
loop distribution in this manner. First, loop distribution may be impossible 
because of cyclic data dependences in the loop. In such cases, compilers add 
statement guards to implement the CPs and, except for Paradigm, don’t re- 
duce loop bounds. The Paradigm compiler reduces loop bounds to the convex 
hull of the iteration spaces of statements inside the loop, in order to reduce 
the number of guards executed [5]. Second, fragmenting a loop nest into a 
sequence of separate loops over individual statements can (a) significantly 
reduce reuse of cached values between statements, and (b) significantly in- 
crease the contribution of loop overhead to overall execution time. A loop 
fusion pass that attempts to recover cache locality and reduce loop bound 
overhead is possible, but complex. Both the IBM and the Portland Group 
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HPF compilers use a loop fusion pass, but apply it only to the simplest cases, 
namely conformant loops [10,16]. 

Kelly, Pugh fe Rosser describe an aggressive algorithm to generate efficient 
code without relying on loop distribution, for a loop nest containing multiple 
statements with different iteration spaces [26]. Given a sequence of (possibly 
non-convex) iteration spaces, the algorithm, mmcodegen, synthesizes a code 
fragment to enumerate the points in the iteration spaces in lexicographic 
order, tiling the loops as necessary to lift guards out of one or more levels 
of loops. Thus, the algorithm provides one of the key capabilities required to 
support our general CP model. The algorithm is briefly described in Appendix 
A along with an example that highlights its capabilities. 

As mentioned earlier, one potential drawback of an algorithm like mm- 
codegen is that it can be costly for large loop nests with multiple iteration 
spaces. Because of the potential for achieving high performance, however, we 
use the algorithm in dHPF as one of the core techniques for supporting the 
general CP model, and develop a strategy to control compile-time cost while 
still producing high quality code. To our knowledge, this is the first use of 
their algorithm for code generation in a data-parallel compiler, and therefore 
these issues have not been addressed previously. 

The techniques used in previous compilers to partition code in the pres- 
ence of control-flow are not clearly described in the written literature. Some 
compilers simply replicate the execution of control-flow statements on all pro- 
cessors, which is a simple method to ensure correctness but can substantially 
reduce parallelism in loops containing control flow [19]. Many other compilers 
ignore control flow because loop bounds reduction and ownership guards will 
enforce the appropriate CP for enclosed statements. However, this approach 
sacrifices potential parallelism for the sake of simplifying code generation. In 
particular, by not enforcing an explicit CP for a block structured IF state- 
ment, all processors that enter the scope enclosing the IF will execute the 
test, enter the appropriate branch, and execute CP guards for the statements 
inside, even though some of the processors do not need to participate in the 
execution of the enclosed statements at all. 

4.2.1 Realizing Computation Partitions in dHPF. We have developed 
a hierarchical code generation framework to realize the general class of par- 
titionings that dHPF supports. Our approach is hierarchical because it oper- 
ates on nested block structured scopes, one scope at a time. (A scope refers 
to a sequence of statements immediately enclosed within a procedure, a DO 
loop, or a single branch of an IF statement.) Key benefits of the hierarchical 
code generation framework are (1) it supports partitioning of scopes that 
can contain arbitrary control flow, (2) it supports multiple code generation 
strategies for individual scopes so that each scope can be partitioned with 
the most appropriate strategy, and (3) it uses a two pass strategy that is 
effective at minimizing guard overhead without sacrificing compile-time effi- 
ciency. Below, we first describe the hierarchical code generation framework. 
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and then describe strategies used in dHPF to generate code for a single scope 
within this framework. 

The hierarchical code generation framework. Briefly, the code generation 
framework in dHPF operates as follows. Each scope in the program is han- 
dled independently. The code generation operates one scope at a time, visiting 
scopes bottom- up (i.e., innermost scopes first). Although the framework op- 
erates one scope at at time, any particular strategy can partition multiple 
scopes in a single step if desired. At each scope that has not yet been parti- 
tioned, the framework attempts to apply a sequence of successively more gen- 
eral strategies until one successfully performs the partitioning for the scope. 

Control flow in the program is handled as follows. This discussion assumes 
that a pre-processing phase has transformed all DO loops into DO/ENDDO form 
and and relocated all branch target labels to CONTINUE statements. The com- 
piler computes partitionings for control-flow statements so that all processors 
that need to execute any statement reach it, but as few processors execute 
the control-flow statements as possible in order to maximize parallelism. In- 
formally, the computation partitioning for an IF statement or a DO loop must 
involve all of the processors that need to execute any statement that is tran- 
sitively control-dependent on it. For an IF statement, the union of the CPs 
of its control dependents gives a suitable CP. For a DO loop, the the union 
of the CPs of its control dependents gives a CP suitable for a single itera- 
tion; a suitable CP for the entire compound DO statement can be computed 
by vectorizing the iteration CP over the loop range. A GOTO statement is 
assigned the union of the CPs of all control-flow statements on which it is 
immediately control-dependent. Note that this will be a superset of the CPs 
of the statements following the branch target, i.e., any processors that need 
to execute the target statements will reach those statements (the condition 
controlling the branch is satisfied). Finally, a branch target label is assigned 
the CP of the immediately enclosing scope, for reasons discussed below. 

In order to ensure correctness while preserving maximum parallelism in 
the presence of control-flow, the code generation framework simply has to 
ensure that the above assignment of CPs to control-flow related statements 
is correctly preserved. In particular, any particular strategy used to partition 
a scope must correctly enforce the CPs of all statements within the scope. The 
only subtlety is that reducing the bounds of a DO loop does not enforce the 
CP of the compound DO loop statement itself; that must be enforced when 
partitioning the scope enclosing the DO loop (otherwise, extra processors 
may evaluate the bounds of the DO loop). 

The above handling of partitioned control flow statements yields a key 
simplification of the framework, namely, that each block-structured scope can 
be handled independently, even in the presence of arbitrary control flow such 
as a GOTO that transfers control from one scope to another. In particular, 
the CPs assigned to individual GOTO statements is simply enforced during 
code generation, independent of the location and GPs of its branch targets. 
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A significant difficulty that must be addressed is that GOTOs are matched 
with the correct branch targets (and labels). This can be difficult because 
statements in different scopes and even statements in the same scope may 
be cloned into different numbers of copies. (Statements can be cloned by 
the tiling performed by mmcodegen to minimize guards, as shown in the 
example in Appendix A.) We ensure that branches and branch targets are 
matched, as follows. 

Fortran semantics dictate that a GOTO cannot branch from an outer to 
an inner scope; a GOTO can only branch within the same scope or to an outer 
scope. The CP assigned to a GOTO may cause it to be cloned into multiple 
copies. By assigning a labeled statement the CP of its enclosing scope, we 
ensure that a label appears in every instance of that scope (in particular, in 
a DO scope, every iteration of the enclosing scope will include the labeled 
CONTINUE). Since every GOTO branching to this label must come from 
the same or an inner scope, every cloned copy of the GOTO will be matched 
with exactly one copy of the labeled CONTINUE. Furthermore, the GOTOs 
that must match a particular copy of the label are exactly those that appear 
within the same instance of the scope enclosing the label. This allows us to to 
renumber the label definitions and any matching label references. The details 
of the renumbering scheme are described elsewhere [1]. 

The second key issue we address with the framework is controlling the 
compile-time cost of using expensive algorithms such as mmcodegen, while 
still ensuring that guard overhead is minimized. There are two features in 
the framework that ensure efficiency. First, we take advantage of the inde- 
pendent handling of scopes to apply mmcodegen independently one loop or 
one perfect loop nest at a time. This ensures that the iteration spaces in each 
invocation of MMCODEGEN are as simple as possible (though there can still be 
multiple different iteration spaces). Second, the use of a bottom-up approach 
greatly reduces the number of times mmgodegen is invoked, compared to a 
top-down approach. The drawback however, is that the top-down approach 
could yield much more efficient code. This tradeoff between the bottom-up 
and top-down approaches arises as follows. 

When generating code for a DO scope, the loop’s iteration space is often 
split into multiple sections to enable guards to be lifted out of the loop. If 
we used a top-down scope traversal order for generating code, information 
about the bounds of the different sections could be passed downward during 
code generation and exploited in inner scopes. However, a top-down strategy 
would require many more applications of the partitioning algorithm than a 
bottom-up strategy. For example, for a triply nested loop in which each loop 
will be split into two sections by code generation, a top-down strategy would 
invoke the loop partitioning algorithm seven times. A bottom-up strategy 
would invoke it only three times. Because of the potentially high compile- 
time cost of the former, we use a bottom-up code generation strategy. 
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do i = 1, N 
Sl(i) 

do j = 1, M 
S2(i,j) 
S3(i,j) 

enddo 

enddo 



LoopCPiO = CPi(l : TV) LoopCPj{l : N) 

CPi{i) = 1 [i] : proc. myid owns Ai{fi{i)) 

LoopCPj(i) = CP2{i, 1 : M) CPs(i, 1 : M) 

CP2(i,j) = : proc. myid owns j)) <> 

CPsiiyj) = : proc. myid owns A3(f3{i, j)) 



Fig. 4.1. Example showing iteration sets constructed for code generation. 



We use two techniques to ensure the quality of the generated code de- 
spite the trade-offs made above. First, an important optimization we apply 
when generating code for individual scopes is to exploit as much known con- 
textual information as possible about the enclosing scopes. Second, we use 
a powerful, global control-flow simplification phase as the last step of code 
generation, which further simplifies the control-flow in the resulting program. 
The control- flow simplification algorithm is described in section 6. The use of 
available contextual information during the bottom-up strategy is described 
below. Together, these achieve much or all of the benefit of a top-down code 
generation strategy in which full context information is available to code 
generation in inner scopes, but at a fraction of the cost. 

When generating code for a scope in the bottom-up strategy, we can 
assume that code generation in the enclosing scope will ensure that only the 
correct processors enter the current scope. For example, consider the loop 
nest in Fig. 4.1. Statements si, s2, and s3 each have a simple computation 
partition consisting of a single ON_HOME clause. LoopCPj{i), represents the 
CP for the j loop, which consists of the union of the CPs of the statements 
in its scope vectorized across the range of the j loop. Inside the j loop, we 
assume that the constraints in LoopCPj{i) hold because these constraints 
will have been enforced when partitioning the enclosing i loop. Any code 
generation strategy used for the inner scope can exploit this information. 
Similarly, LoopC Pi{) represents the CP for the i loop, which consists of the 
union of the CPs of the statements it contains, vectorized across the range 
of the i loop. We assume that the constraints in LoopC Pi{) are true when 
generating code for the i loop. 

Realizing a CP for a single seope. The first step in the process for parti- 
tioning a scope is to separate the statements in the scope into statement 
groups, which are sequences of adjacent statements that have homogeneous 
computation partitions. Second, we use equation 4.2 to construct the explicit 
representation of the iteration space for each statement group according to 
its computation partitioning. Third, we use some available strategy to par- 
tition the computation in the scope. When multiple strategies are available, 
we currently apply them in a fixed sequence, stopping when one succeeds in 
partitioning the scope. This permits a fixed series of strategies to be tried 
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(typically attempting specific optimizations and, if these fail, then finally 
applying some general strategy). 

We currently support two strategies for partitioning a loop: bounds re- 
duction, or loop-splitting combined with bounds reduction for the individual 
loop sections generated by splitting (the latter is described in Section 5.3). 
For non- loop scopes (conditional branches and the outermost routine level), 
we also use bounds reduction which reduces to inserting guards on the rel- 
evant statements. Two alternatives applicable to loops or statement groups 
with irregular data layouts or irregular references are under development, 
namely, runtime resolution and an inspector-executor strategy. 

To perform bounds reduction as part of the above strategies, we apply 
Kelly, Pugh & Rosser’s mmcodegen algorithm to the sequence of iteration 
spaces for the statement groups in a scope. Applying mmcodegen to the 
iteration spaces for statement groups reduces loop bounds as needed and 
lifts guards out of inner loops when statement groups with non-overlapping 
iteration spaces exist. This results in a code template template with place- 
holders representing the statement groups. Finally, we replace each of the 
placeholders in the code template by a copy of the code for the correspond- 
ing statement group. When labels are present in the code for the statement 
groups, we renumber the labels to ensure unique numbers as discussed earlier. 

As an alternative to this base strategy for realizing the computation par- 
tition for a scope. Section 5.3 describes a loop splitting transformation that 
may be applied during code generation to any perfect loop nest that has no 
carried dependences. From the perspective of the computation partitioning 
code generation, this approach serves as an alternate partitioning method 
which subdivides the iteration space for a DO into a sequence of iteration 
spaces, and then generates code for each with the method described above 
for a single scope. The purpose of the splitting transformation is described 
in section 5.3. 

Another much more specialized strategy we expect to add for loop nests 
is a code transformation for coarse-grain pipelining. The transformation si- 
multaneously performs strip-mining of one or more non-partitioned loops 
and loop bounds reduction for partitioned loops. The last two optimizations 
(loop-splitting and pipelining) illustrate that the hierarchical code genera- 
tion framework provides a natural setting within which to perform any code 
transformation that has the side effect of producing partitioned code, i.e., 
realizing the CPs assigned to statements in a scope. 



5. Communication Code Generation 

On message-passing systems, the most efficient communication is obtained 
when the compiler can statically compute the set of data that needs to be 
exchanged between processors to satisfy each non-local reference. For refer- 
ences with statically analyzable communication requirements, a data-parallel 
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compiler must compute the set of non-local data to be communicated for 
each non-local reference, and then use these sets to generate efficient code to 
pack, communicate, unpack, and access the non-local data. In this section, we 
describe implementation techniques for several key communication optimiza- 
tions used in the dHPF compiler to synthesize high performance communi- 
cation code for regular applications. Many of these techniques are based on 
the integer set framework. For references with unanalyzable communication 
requirements, typically due to non-afhne subscripts, runtime techniques such 
as the inspector-executor model must be used to manage communication. For 
more information on such techniques, we refer the reader to Chapter 21 and 
the references therein. 

The dHPF compiler includes a comprehensive set of communication opti- 
mizations that have been identified as important for high performance on 
message-passing systems. The benefits obtained from these optimizations 
(with very few exceptions) can vary widely between applications as well as 
between different systems. This implies that a compiler may have to incor- 
porate many different optimizations to obtain consistently high performance 
across large classes of applications and systems. Previous commercial and 
research compilers, however, have generally implemented only a few of these 
techniques because of the significant implementation effort entailed in each 
case. The important communication optimizations in the dHPF compiler in- 
clude the following. 

Optimizations to Reduce Message Overhead 

— Message vectorization moves communication out of loops in order to re- 
place element-wise communication with fewer but larger messages. This 
is implemented by virtually all data-parallel compilers, but in case-based 
compilers it is usually restricted to specific reference patterns for which 
the compiler can derive (or conservatively approximate) the data sets to 
be communicated [16,24,32]. 

— Exploiting collective communication is essential for achieving good speedup 
in important cases such as reductions, broadcasts, and array redistribu- 
tion [33]. On certain systems, collective communication primitives may 
also provide significant benefits for other patterns such as shift commu- 
nication. The important patterns (particularly reductions and broadcast) 
have been supported in most data-parallel compilers. 

— Message coalescing combines messages for multiple non-local references to 
the same or different variables, in order to reduce the total number of mes- 
sages and to eliminate redundant communication. Previous implementa- 
tions in Fortran 77D [24], SUIF [2], Paradigm [5], and IBM’s pHPF [11,16] 
have some significant limitations. In particular, coalescing can produce 
fairly complex data sets from the union of data sets for individual refer- 
ences. The previous implementations are limited to cases where the com- 
bined data sets are representable with (or can be approximated by) regular 
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sections (in Fortran 77D, Paradigm and pHPF) or a single collection of in- 
equalities (in SUIF). 

— Coarse-grain pipelining trades off parallelism to reduce communication 
overhead in loop nests with loop-carried, cross-processor data dependences. 
It is an important optimization for effectively implementing parallelism in 
such loop nests because the only alternative may be to perform a full array 
redistribution, which can be much more expensive. To our knowledge, this 
optimization has been implemented in a few research compilers [5,24] and 
one commercial one [16]. 

Optimizations to Overlap Communication with Computation 

— Dataflow-based communication placement attempts to hide the latency of 
communication by placing message sends early and receives late so as to 
overlap messages with unrelated computation. A few compilers including 
IBM’s pHPF, SUIF, Paradigm, and Fortran 77D have used dataflow tech- 
niques to overlap communication in this manner. 

— Communication overlap via non-local index set splitting attempts to over- 
lap communication from a given loop nest with the local iterations of the 
same loop nest. This overlap generally cannot be achieved by the above 
dataflow placement techniques. Non-local index set splitting (or loop split- 
ting) separates iterations that access non-local data from those that access 
only local data. Communication can be overlapped with the local itera- 
tions by first executing send operations for the non-local data required in 
the loop, then the local iterations, then the receives, and Anally the non- 
local iterations. Loop splitting was implemented in Kali [32], albeit with 
significant limitations as described in Section 5.3. 

Optimizations to Minimize Data Buffering and Access Costs 

— Minimizing buffer copying overhead is essential to minimize the overall cost 
of communication. This can be achieved in multiple ways. First, in most 
message-passing implementations, when the data to be sent or received 
is contiguous in memory, it can be communicated “in-place” rather than 
copied to or from message buffers. Second, asynchronous send and receive 
primitives can be used to avoid additional buffer copies between user and 
system buffers by making user-level buffers available for the duration of 
communication. Third, non-local data received into a buffer can in some 
cases be directly referenced out of the buffer (if the indexing functions can 
be generated by the compiler), thus avoiding an unpacking operation. All 
of these techniques appear to be widely used in data-parallel compilers, 
though the effectiveness of the implementations may vary. 

— Minimizing buffer access checks via non-local index-set splitting. Access 
checks (i.e., ownership tests) are required when the same reference may 
access local data from an array or non-local data from a separate buffer on 
different loop iterations. Loop-splitting separates out the local iterations 
(which are guaranteed to access local data) from the non-local ones. Even 
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the latter may not need access checks if all non-local references now access 
only non-local data. The alternative to this transformation is to copy local 
and non-local data into a common buffer (as done in the IBM pHPF com- 
piler [16]), which can be costly in time and memory usage. As mentioned 
above, non-local index set splitting was implemented in a limited form in 
Kali. 

— Overlap areas for shift eommunieation are extra boundary elements added 
to the local sections of arrays involved in shift communication [14]. They 
permit local and non-local data to be referenced uniformly, thus avoiding 
the need for the access checks (or alternatives) mentioned above. Generally, 
interprocedural analysis has to be used to determine the size of required 
overlap areas globally for each array. Simpler implementations may waste 
significant memory and may have to be controlled by the programmer. 
Overlap areas have been implemented in several research and commercial 
compilers. 

The Rice dHPF compiler implements all of the above optimizations ex- 
cept coarse-grain pipelining and the use of asynchronous message primitives 
(both these are currently being implemented). Some other specific communi- 
cation optimizations have been implemented in other compilers but are not 
included in dHPF. IBM’s pHPF coalesces diagonal shift communication into 
two messages [16], whereas dHPF requires three. This is useful, for example, 
in stencil computations that access diagonal neighbors such as a nine-point 
stencil over a two-dimensional array. Chakrabarti et al. describe a powerful 
communication placement algorithm that can be used to maximize opportu- 
nities for message coalescing or to balance message coalescing with communi- 
cation overlap [11]. SUIT uses array dataflow analysis to communicate data 
directly from a processor executing a non-local write to the next processor 
executing a non-local read [2], whereas dHPF must use an extra message to 
send the data first to the owner and from there to the reader. The former 
two optimizations can be directly added to the current implementation of 
dHPF. The SUIF model is a different and significantly more complex com- 
munication model compared to that used in dHPF, and there is little evidence 
available so far to evaluate whether the additional complexity is justified for 
message-passing systems. 

One reason that it has been practical to implement a fairly large collection 
of advanced optimizations in dHPF is our use of the integer set framework. 
By formulating optimizations abstractly in terms of integer set operations, 
we have obtained simple, concise, and general implementations of some of 
the most important phases of the compiler (such as communication code 
generation) as well as of complex optimizations like loop-splitting. These im- 
plementations broadly apply to arbitrary combinations of affine references, 
data distributions, and computation partitionings, because the analysis is not 
dependent on specific forms of these parameters. In the remainder of this sec- 
tion, we briefly describe the implementation of communication optimizations 
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that use the integer set framework. These include our entire communication 
generation phase which incorporates message vectorization and coalescing, 
the two optimizations based on non-local index set splitting and an algo- 
rithm for recognizing in-place communication. A control-flow simplification 
algorithm, which is also implemented using integer sets, is described in Sec- 
tion 6. 

5.1 Communication Generation with Message Vectorization and 
Coalescing 

The communication insertion steps in dHPF can be classified into two phases: 
a preliminary decision-making phase that identifies and places the required 
communication in the program, and a communication generation phase that 
computes the sets of processors and data involved in each communication, and 
synthesizes code to carry out the communication. In this paper, we primarily 
focus on the communication generation phase which is based on the integer 
set framework. We briefly describe the decisions made in the former phase, 
since these directly feed in as inputs to communication generation. 

The preliminary communication analysis steps in dHPF determine (a) 
which references are potentially “non-local”, i.e., might access non-local data, 
(b) where to place communication for each reference, (c) whether to use a 
collective communication primitive in each case, and (d) when communica- 
tion for multiple references can be combined. The first step is a very simple 
analysis to filter out references that can easily be proven to access only lo- 
cal data. The second step uses a combination of dependence and dataflow 
analysis to choose the placement of communication so as to determine how 
far each message can be vectorized out of enclosing loops, and to optionally 
move communication calls early or late to hide communication latency [30]. 
The third step uses the algorithms of Li and Chen [33] to to determine if 
specialized collective communication primitives such as a broadcast could be 
exploited. (Reductions are recognized using separate algorithms.) Otherwise, 
the compiler directly implements the communication using pairwise point-to- 
point communication. The fourth step chooses references whose communica- 
tion can be combined. Any two references whose communication involves one 
or more common pairs of processors can be coalesced in our implementation. 
In practice, however, it it is usually not beneficial to combine references that 
should use different communication primitives, such as a broadcast with any 
pairwise point-to-point communication. (One instance where it is profitable 
is combining a reduction and a broadcast by using a special reduction prim- 
itive like MPI_AllReduce, which leaves every processor involved with a copy 
of the result.) We refer to the entire collection of messages required for a set 
of coalesced references as a single logical communication event. 

The code generation phase must then use the results of the previous phases 
to synthesize vectorized and coalesced messages that implement the desired 
communication for each logical communication event. For each reference, the 
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DataAccessedMap = 

{ [pl,p2]->[bl,b2] : 

rnaxd, 20*pl) <= bl <= min(20*pl+19 , 58) ScSc 
max(2, 20*p2+l) <= b2 <= min(20*p2+20 , 59) J; 

nlDataAccessed({ml ,m2}-) = 

{ [bl,b2] : 

1 <= ml <= 2 && bl = 20*ml && 

max(2, 20*m2+l) <= b2 <= min(20*m2+20 , 59) J; 

SendCoimnMap(-[ml ,m2]-) = 

{ [pl,p2]->[bl,b2] : 

pi = ml+1 ScSc p2 = m2 && 0 <= ml <= 1 && bl = 20*ml+20 && 
max(2, 20*m2+l) <= b2 <= min(20*m2+20 , 59) J; 

RecvCoimnMap(-[ml ,m2]-) = 

{ [pl,p2]->[bl,b2] : 

pi = ml-1 ScSc p2 = m2 && 1 <= ml <= 2 && bl = 20*ml ScSc 
max(2, 20*m2+l) <= b2 <= min(20*m2+20 , 59) J; 



Fig. 5.1. Example maps for communication due to reference B(i-l,j) in 
the Jacobi kernel of Figure 2.1, assuming N = 60 



compiler first computes the set of data to send between pairs (or groups) of 
processors; these communication sets depend on the reference, layout, com- 
putation partitioning, and the loop level at which vectorized communication 
is to be performed. Message coalescing requires computing the union of the 
above communication sets for the coalesced references. We directly compute 
the communication sets for each communication event using a sequence of 
integer set operations, independent of the specific form of the reference, lay- 
out, and computation partitioning. We then generate code from these sets 
directly. 

The integer set equations used to compute the communication sets for 
each logical communication event are described in detail elsewhere [1]. We 
briefly describe the key aspects of the algorithm here. The goal of the algo- 
rithm is to compute two separate maps for a fixed symbolic processor index 
m (where m is the index tuple for processor myid in the processor array to 
which the data is mapped, and myid is the representative processor index of 
the SPMD program). 

SendCommMap(m) = [p] ^ [a] : array elements a that 

proc. m must send to proc. p 

RecvCommMap(m) = [p] ^ [a] : array elements a that 

proc. m must receive from proc. p 
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To illustrate how these maps are computed, Figure 5.1 shows some of the 
intermediate results and the final resulting maps for a single non-local ref- 
erence B(f 0 l,j) in the Jacobi kernel example of Figure 2.1, assuming 
N = 60. (We chose this very simple example to make it easy to under- 
stand the maps, although it does not illustrate the generality of our inte- 
ger set formulation.) Here, the final SendCommMap{m) specifies that pro- 
cessor myid (whose index tuple is m = must send the bound- 

ary values B(20mi -|- 20, 20m2 -h 1 : 20rri2 + 20) to its right neighbor, 
p : Pi = mi + 1,P2 = m 2 , except that processors with mi = 2 do not send 
any data. RecvCommMav(m) specifies that processor myid must receive 
the boundary values B( 20 mi, 20 m 2 + 1 : 20 m ,2 + 20) from its left neighbor, 
p : Pi = mi 0 1,P2 = 1^2, except processors with mi = 0 do not receive any 
data. The min and max terms in the maps exclude the communication of 
edge elements that are not accessed. The key steps used to compute these 
maps are as follows. 

For a given reference, we first compute a map, DataAccessedM ap, de- 
scribing the entire set of data (local and non-local) accessed by each processor 
p in all iterations of the loops out of which communication has been vector- 
ized. This is done by composing the computation partitioning and reference 
maps (see Figure 5.1). Then, for a read reference, the set of non-local data 
accessed by the processor m (denoted nlDataAccessed{m)) is the difference 
between the data accessed and the data owned by that processor. For a write 
reference, the set of non-local data accessed is the intersection of the data ac- 
cessed by m and the data owned by all other processors p, since a write must 
update all owners of the data. (In the absence of replicated data, this step 
would be equivalent for reads and writes.) Now, the data that m must receive 
from each processor p is the intersection of the non-local data accessed by m 
and the data owned by p. The data that m must send to each processor p is 
the intersection of the non-local data accessed by p and the data owned by 
m. For a single reference, the last two results are exactly RecvCommMap(m) 
and SendCommMap(m) respectively. To coalesce communication for multi- 
ple non-local references (including both reads and writes) , we simply take the 
union of these maps over all coalesced references. 

If any of the array elements accessed by a read reference is replicated, a 
simple additional step is necessary to ensure that only a single owner sends 
the element to each processor that reads it. Similarly, if the CP for a write 
reference is replicated, a similar step ensures that only one writer sends the 
data back to each owner. To avoid communication bottlenecks, we ensure 
that all the owners (or writers) participate by providing data to different 
groups of destination processors [1]. 

For the case of coarse-grain pipelining, we use another additional step 
to account for the blocking of communication (i.e., the granularity of the 
pipeline). Specifically, in the range of the above maps, we extend the dimen- 
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sion to be blocked from a single value to a range, using symbolic bounds to 
represent the block of data communicated in each pipeline stage. 

SendCommMap and RecvCommMap are used by the code generator 
to synthesize communication. Several alternative communication strategies 
can be directly supported. To implement pairwise, point-to-point communi- 
cation we synthesize separate loops to iterate over the domain (processor set) 
of each map. Note that these loops will enumerate exactly those processors 
with which processor myid must exchange data. The algorithm described in 
Section 5.2 is then used to determine whether data to be communicated is 
contiguous in memory, or if this is not statically provable, to emit a logical 
expression that checks this at run-time. In the latter case, we also synthe- 
size a loop to iterate over the range of SendCommMap (i.e., the data to 
be sent to p) and use it to copy the data to a buffer. When we check con- 
tiguity at runtime, we generate both in-place and buffered communication 
for a particular communication event. For the receiving end, data can be 
received in-place if overlap areas are used and the communicated section is 
contiguous. The receiver uses the same approach to determine whether to 
receive data in place or into a communication buffer. We currently use the 
MPI_Bsend and MPI_Brecv primitives for synchronous point-to-point commu- 
nication, which guarantee sender-side buffering. This is a simple solution to 
ensure that deadlock does not occur in cases where all processors have to 
send and receive data, but it may introduce excess buffer copy operations 
at the sender. We are currently extending our communication generation to 
use asynchronous message-passing primitives which can significantly reduce 
buffer copying overheads at the sender and receiver. 

It is straightforward to extend the above approach to exploit collective 
communication primitives, in cases where these will be more efficient than 
point-to-point communication. For example, using a broadcast operation sim- 
ply requires eliminating the processor loop and using a single call to broadcast 
the data set to all processors. The methods to determine if data is contiguous 
and to generate the buffer packing code both remain unchanged. Reductions 
require separate code generation steps to synthesize code to compute local 
partial sums within each processor. The data set for a reduction is simply 
the entire temporary variable (scalar or array) used to hold the local partial 
sums. Thereafter, the code generation for buffering and communication steps 
of a reduction are the same as above. 

In summary, code generation for message vectorization and coalescing re- 
lies on the integer set framework to compute the processor and data sets for 
communication, and use these to synthesize code for explicit communication. 
The algorithms are independent of the specific form of the reference, data 
layout, and computation partitioning involved, and fully support the gen- 
eral computation partitioning model in dHPF. Independent algorithms are 
used to determine whether buffering is required, and to minimize the costs 
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of accessing buffered non-local data. These are described in the following 
subsections. 

5.2 Recognizing In-Place Communication 

Whether compiling HPF for shared-memory or message-passing architec- 
tures, avoiding excess data copies can boost performance. Here we describe a 
technique developed to avoid data copies when generating code for a message 
passing communication model based on MPI. Common MPI implementations 
permit data to be sent or received “in-place” (avoiding an explicit data copy) 
when the address range of the data is contiguous. To increase the likelihood 
that communication can be performed in-place, we have developed a com- 
bined compile-time/run-time algorithm for recognizing contiguous data based 
on our capability of generating code from integer-sets. 

A communication set for an n-dimensional Fortran array represents con- 
tiguous data if there is some dimension fc, 1 ~ fc ~ n such that, for the 
high-order dimensions 1 ~ z < fc, the set spans the full range of array di- 
mension z, along dimension k the set has a contiguous index range, and in 
the low-order dimensions A: -|- 1 ~ j ~ zz, the set contains a single index 
value. Using our integer-set representation for communication sets, we can 
express these individual conditions as Boolean predicates, requiring up to 
three predicates for each array dimension. The existence of a value k satisfy- 
ing the above properties can be directly expressed as a logical combination 
of these predicates, first eliminating those predicates that can be proven true 
or false at compile time. We construct and express this logical expression as 
an integer set. If the entire expression can be proven true or false at compile 
time, then we generate a single version of code for communicating contiguous 
or non-contiguous data respectively (i.e., with and without buffer packing). 
If the expression cannot be proven true or false, we synthesize code from the 
integer set to test the condition at runtime. In this case, we generate both 
versions of communication code (with and without buffer packing). 

Directly checking the logical condition constructed above requires check- 
ing O(rz^) terms (conjunctions of predicates), at compile-time or runtime. We 
can in fact reduce this cost to 0{n) by using a single scan of the dimensions 
(leftmost first) to find the first dimension k which cannot be proved to span 
the full range of the array dimension, and then checking the predicates for 
k . . .n. If these predicates cannot be proven at compile time, we can syn- 
thesize code to repeat this scan and check at runtime, when it can be done 
precisely. In practice, however, the number of array dimensions n is typically 
small, and many of the predicates are statically proven true or false. The cost 
of evaluating the logical condition at run-time will usually be much smaller 
than the cost of packing all but the smallest messages into a communica- 
tion buffer. Therefore, we take the simpler approach (described above) of 
constructing and evaluating the logical condition directly as a single set. 
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By combining compile-time and runtime decisions to identify contiguous 
data, we obtain maximum efficiency when the data is provably contiguous 
but also minimize the likelihood that explicit buffer packing will be needed 
when this decision cannot be made until runtime. Furthermore, by basing the 
analysis directly on an explicit integer set representation of the data, we can 
apply it to arbitrary communication sets, independent of data layouts and 
communication patterns. 

5.3 Implementing Loop-Splitting for Reducing Communication 
Overhead 

As described at the outset of this section, loop-splitting (or iteration re- 
ordering) techniques can be used to ameliorate two types of communication 
overhead: the latency of communication, and the cost of referencing buffered 
non-local data. Both techniques involve splitting a loop to separate the it- 
erations that access only local data from those that may access non-local 
data. After splitting, we can overlap communication for the loop with the 
computation in the local iterations, which do not require the communicated 
data. We can also reduce the number of ownership guards executed before 
non-local references by eliminating the guards in the local iterations, and 
perhaps some in the non-local iterations as well. 

The only implementation of loop-splitting we know of is in Kali [32], 
where the authors used set equations to explain the optimization but used 
case-based analysis to derive the iteration sets (during compiler development) 
for a few special cases restricted to one-dimensional data distributions. This 
approach is only practical for a small number of special cases. We have ex- 
tended the equations in [32] to apply to an arbitrary number of non-local 
references with any regular data layouts and any CP in our CP model, using 
the sets and mappings described in previous sections. We first describe the 
loop-splitting analysis for communication overlap, because the loop transfor- 
mations in this case subsume the transformations required for splitting for 
buffer access. 

Loop-splitting in dHPF is applied one perfectly nested sequence of loops at 
a time, for loops that satisfy the conditions below. The restriction to perfectly 
nested loops is not essential, but slightly simplifies our implementation of 
the code generation. The analysis described here can be used unchanged for 
imperfect loop nests. We look for a maximal loop-nest that includes loops 
for which it is legal to reorder the loop iterations, and also loops which we 
can predict will not to be reordered. It is legal to reorder iterations of a 
loop if it has no loop-carried data dependences and it does not enclose any 
communication operations. We can predict that certain loops will not be 
reordered, as follows. Note that for a read reference, a subscript in a non- 
distributed array dimension will not cause the reference to be non-local. (In 
a write reference, however, such a subscript could induce communication to 
send the data back to all the owners.) Therefore, non-local index set splitting 
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SEND data for non-local reads 

execute nlWOIters 

SEND data for non-local writes 

execute locallters 

RECV data for non-local reads 

execute nlROIters 

RECV data for non-local writes 



SEND data for non-local reads 

execute nlWOIters 

execute locallters 

RECV data for non-local reads 

execute nlROIters |J nlRWIters 

SEND data for non-local writes 

RECV data for non-local writes 



(a) if nlRWIters is empty (b) if nlRWIters is non-empty 

Fig. 5.2. Generated code for overlapping communication and computation. 



will not reorder the iterations of a loop whose loop index variable is used 
to index only local references and non- distributed array dimensions in non- 
local RHS references. Therefore, we can ignore any dependences that may be 
carried on such loops, and these loops can be safely included in the loop nest. 

Since any write reference as well as any read reference may be non-local 
in dHPF, our goal is to separate the iterations of a loop nest into four sec- 
tions: those that access only local data (locallters), and those that only 
read, only write, or read and write non-local data (nlROIters, nlWOIters 
and nlRWIters respectively). These sets are computed using a sequence of 
integer set equations, taking as input the basic sets and maps of Figure 3.1 
and the non-local data set, nlDataAccessed(m), computed as an intermedi- 
ate result for the communication sets as described in Section 5.1. The detailed 
equations are explained in [1]. The key step is computing the iterations that 
access non-local data for each given reference. The non-local data set for 
the reference describes the non-local data accessed by the processor myid. 
Composing this map with Ref Map^^ gives the iterations that access these 
non-local elements. For example, consider the reference B(i-l,j) in the Ja- 
cobi kernel of Figure 2.1. We compose RefMap^^ = ^[6i, 62] ^ [^ii h] ■ bi = 
0 1 ^62 = Z2'0with the non-local data set nlDataAccessed(m) shown 
in Figure 5.1. This yields precisely the set of boundary iterations that ac- 
cess non-local data due to that reference: : 1 — "fni — 2 y'li — 

20 I mi -h 1 /^max(2, 20 I m 2 + ^) — h — min(20 [m 2 + 20, 59)0 By taking 
unions over non-local read and write references respectively, we directly get 
the set of iterations that read non-local data and the set of iterations that 
write non-local data. These sets may not be disjoint, but from these two 
sets and the CP, the four desired sets can be directly computed. The code 
for these individual loop sections is directly synthesized from the respective 
integer sets by using mmcodegen. 

We schedule the communication and computation for this loop nest in 
the sequence shown in Figure 5.2. Both read and write communication la- 
tency would be overlapped with some computation if all non-local writes are 
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performed first, then local iterations, and finally non-local reads. This is pos- 
sible when nlRW Iters is empty, as shown in Figure 5.2(a). If nlRW Iters 
is non-empty, however, these iterations both read and write non-local data, 
and these must be placed after the RECV for non-local reads and before the 
SEND for non-local writes. Therefore, we can overlap either read or write 
communication with locallters, but not both. A simple heuristic could be 
used to choose between the two alternatives, by comparing how early read 
data is produced and how late write data is consumed. For now, however, we 
simply overlap read communication with nlWOIters and locallters, and we 
merge nlRW Iters with nlWOIters, as shown in Figure 5.2(b). 

The goal of the second optimization based on loop-splitting is to minimize 
the number of ownership guards executed before non-local references. A ref- 
erence would not need a guard if it accesses only local or only non-local data 
in all iterations of the enclosing loops. Therefore, in a loop with r non-local 
references, the ownership guards could be completely eliminated by splitting 
the index space into 2’’ (g) 1 non-local subsets, plus the local section. To avoid 
this exponential behavior, we can simply split the loop into one local and one 
non-local subset, and use guards only in the non-local section if necessary. 
If splitting is being applied for communication overlap, that actually creates 
three non-local sections instead of one, and no further transformations are 
required. In any case, references in the local iterations do not need ownership 
guards . A reference in a non-local loop section also does not need such guards 
if is the set of iterations in that section is identical to the set of non-local 
iterations due to that reference alone. 

Code generation for the loop transformations described above is inte- 
grated into the hierarchical code generation framework described in Section 4, 
as an alternative computation partitioning strategy. This is because code gen- 
eration from the local and non-local sets has the side-effect of enforcing the 
CPs assigned to the statements in the loop (including reducing the loop 
bounds and introducing guards if necessary). This follows because each of 
the four loop sections is a subset of the iterations assigned to processor myid 
by these CPs. Thus, the combination of the integer-set-based analysis and 
the hierarchical code generation framework made it quite simple to add even 
these relatively complex optimizations in the compiler. 



6. Control Flow Simplification 

6.1 Motivation 

Several of the strategies for partitioning computation can split a loop’s iter- 
ation space to avoid adding computation partitioning guards inside the loop. 
By splitting the iteration space, such guards can be added between sections 
of the loop instead of inside, which can dramatically reduce the dynamic ex- 
ecution frequency of guards. However, after a loop has been split, the smaller 
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parameter (N=64) 
real a(N) , b(N), f(N,N,N) 

CHPF$ processors p(4) 

CHPF$ template t(N,N,N) 

CHPF$ distribute t(*,*, block) onto p 
CHPF$ align f(i,j,k) with t(i,j,k) 

CHPF$ align a(k) with t(*,*,k) 

CHPF$ align b(k) with t(*,*,k) 

do j=l,N 
do k=2,N 

C SEND f(l:N,j,k-l) 

C RECV f(l:N,j,k-l) 
do i=l,N 

f(i,j,k) = (f(i,j,k) - a(k) ♦ 
enddo 
enddo 
enddo 

Fig. 6.1. HPF source for a fragment from the Erlebacher benchmark, showing 
the preliminary placement of SEND and RECV and the CPs assigned. 



! DN_HQME f(l:N,j,k-l) 
! DN_HQME f(l:N,j,k) 

! 0N_H0ME f(i,j,k) 
f(i,j,k-D) * b(k) 



resulting iteration spaces provide a sharper context that may make some 
conditionals or loops nested inside redundant or unsatisfiable. We illustrate 
these effects with an example later in this section. 

If individual code generation steps attempted to exploit full contextual 
knowledge to avoid or eliminate guards and empty loops, they would have to 
be significantly more complex. Furthermore, implementing such an approach 
would also require rebuilding or incrementally updating analysis information 
after each code generation step. We use Instead, we use a simpler and less ex- 
pensive approach in which we make no effort to avoid control-flow complexity 
that arises as a result of interactions between the different optimization steps, 
and instead use a separate post-pass optimization to eliminate excess con- 
trol flow. This approach has two major advantages; existing code generation 
algorithms can be simpler, and the post-pass control-flow simplification can 
use powerful algorithms that exploit global information about the program. 

The foundation for control flow simplification in dHPF is an algorithm for 
globally propagating symbolic constraints on the values of variables imposed 
by loops, conditional branches, assertions, and integer computations. Sev- 
eral previous systems have supported strategies for computing and exploiting 
range information about variables [8,9,20,25,44]. Two key differences that 
distinguish our work are that we handle more general logical combinations 
of constraints on variables (not just ranges) and we use these constraints to 
simplify control flow. In this section, we present an example that illustrates 
how a sequence of code generation steps results in superfluous loops and con- 
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do j = 1, 64 

1 if (pmyidl >= 1) 

2 k = 16 * pmyidl + 1 

! — « Iterations that access only local values » — 

3 if (16 * pmyidl <= k - 2) UNFEASIBLE 

COMPUTE f(l:64, j, k) 

4 if (16 * pmyidl == k - 1 && pmyidl >= 1) ! TAUTOLOGY 

RECV f(l:64, j, 16*pmyidl) 

! — « Iterations that read non-local values >> — 

5 if (16 * pmyidl >= k - 1) TAUTOLOGY 

COMPUTE f(l:64, j, k) 

6 do k = 16 * pmyidl +2, 16 * pmyidl + 16 

7 if (16 * pmyidl == k - 17 .and. pmyidl <= 2) UNFEASIBLE 

SEND f(l:64, j, k - 1) 

! — « Iterations that access only local values » — 

8 if (16 * pmyidl <= k - 2) TAUTOLOGY 

COMPUTE f(l:64, j, k) 

9 if (16 * pmyidl == k - 1 && pmyidl >= 1) UNFEASIBLE 

RECV f(l:64, j, 16*pmyidl) 

! — « Iterations that read non-local values >> — 

10 if (16 * pmyidl >= k - 1) UNFEASIBLE 

COMPUTE f(l:64, j, k) 
enddo 

if (pmyidl <= 2) 

k = 16 * pmyidl + 17 

11 if (16 * pmyidl == k - 17 .and. pmyidl <= 2) ! TAUTOLOGY 

SEND f(l:64, j, k - 1) 

enddo 

Fig. 6.2. Skeletal SPMD code for Fig. 6.1 with partitioned computation. 



ditionals and then provide a brief overview of our control-flow simplification 
technique. Our algorithm is described and evaluated in detail in [34]. 

To illustrate some of the principal sources of excess control-flow in dHPF, 
Fig. 6.1 shows source code for a loop from the Erlebacher benchmark, in- 
cluding the CPs and the initial placement of communication chosen by the 
compiler. The assignment statement is given the CP on_home f (i, j ,k), 
and communication for the non-local reference f (i, j ,k-l) is placed inside 
the k loop because of a loop-carried flow dependence for array f. Fig. 6.2 
shows the skeletal SPMD code after CP code generation (the entire i loop 
has been shown as “COMPUTE f (1:64, j ,k)” to simplify the figure). Exami- 
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do j = 1, 64 

if (pmyidl >= 1) then 
k = 16 * pmyidl + 1 
RECV f(l:64, j, k-1) 

! — « Iterations that read non-local values >> — 

COMPUTE f(l:64, j, k) 
endif ! (pmyidl >= 1) 

do k = 16 * p_myidl + 2, min (16 * p_myidl + 16, 63) 

! — « Iterations that access only local values » — 
COMPUTE f(l:64, j, k) 
enddo ! k 

if (p_myidl <= 2) then 
k = 16 * p_myidl + 17 
SEND f(l:64, j, k - 1) 

enddo ! j 

Fig. 6.3. Skeletal SPMD code for Fig. 6.2 after simplification. 



nation of the code in Fig. 6.2 shows that many of the guard expressions are 
infeasible or tautological. 

The causes of excess control flow in this example are as follows: 

1. The guards on lines 8 and 10 are initially generated by splitting the i 
loop into local and non-local iterations. At this time it is not known what 
sections of the k loop will finally be generated. We cannot apply non-local 
index set splitting to the k loop because of the loop-carried dependence. 
However, when mmcodegen is applied to reduce loop bounds for the k 
loop, the loop is tiled into 3 sections (k = 16 1 pmyidl -|- 1, ^16 1 pmyidl -|- 
2 ~ k ~ 16 I pmyidl -|- 16C) and k = 16 1 pmyidl -|- 17), so that additional 
guards are not required within the loop to enforce the different CPs of the 
three inner statements. The guards on lines 8 and 10 now get duplicated 
on lines 3 and 5, along with their enclosed COMPUTE blocks. Now, in the 
refined context of these k loop sections, the guards on lines 3 and 10 are 
always false, and those on lines 5 and 8 are always true. 

2. When the k loop is split as described above, the SEND and RECV place- 
holders are duplicated (as shown) during CP code generation because of 
the CPs assigned to the communication placeholders. The CPs assigned 
to placeholders, shown in Figure 6.1, simply specify that the owner of 
f(l:N,j,k) must receive data (since it will execute the i loop), and 
the owner of f(l:N,j,k-l) must send data (since it owns the data be- 
ing read). These CPs are imprecise in that the SEND and RECV should 
actually execute only in the boundary iterations. Precise CPs for com- 
munication are difficult to compute and express in any general manner 
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since communication patterns can be quite complex. Instead, we rely on 
the communication generation phase to insert guards to ensure that only 
required communication occurs. Those guards are the ones shown on lines 
4, 7, 9 and 11. Although the loop context created when partitioning the 
k loop makes these guards infeasible or redundant, the communication 
code is generated without precise knowledge of this refined loop context. 

It is important to note that the excess guards that arise in the code gener- 
ation steps described above are not due to poor code generation algorithms. 
In the first case, they result from a sophisticated loop transformation that 
reduces the dynamic execution frequency of guards. In the second case, avoid- 
ing eliminating redundant guards when instantiating communication would 
require full knowledge of sharper context created by earlier code generation 
steps. 

Another source of excess control flow not illustrated by the example above 
is the insertion of ownership guards for non-local references. As explained in 
Section 5.3, we create one to three sections of non-local iterations, and insert 
ownership guards in each of the non-local sections. However, if the iteration 
set of any non-local section is non-convex, mmcodegen automatically tiles 
the section into convex regions, potentially making some or all of the guards 
in that section redundant. In each case, it is sensible to generate these guards 
oblivious of their context and let a later control-flow simplification pass elim- 
inate any that became unnecessary. This enables our guard insertion algo- 
rithms to remain relatively simple without hurting performance. In the next 
section we describe our global algorithm for eliminating superfluous control 
flow. 



6.2 Overview of Algorithm 

The algorithm for simplifying control flow is based on the property that each 
conditional branch node (i.e., IF or DO loop) in a program guarantees certain 
constraints on the values of variables for the statements control dependent 
on the branch node. The goal of our algorithm is to collect and propagate 
these constraints globally through the code and use this information to sim- 
plify the control flow. Our algorithm combines three key program analysis 
technologies, namely, the control dependence graph (CDG) [13], global value 
numbering based on thinned gated single assignment form [21], and simplifi- 
cation of integer constraints specified as Presburger formulae [27]. The former 
two enable us to derive an efficient, non-iterative algorithm for propagating 
constraints along control dependences, while the latter enables simplification 
of logical constraints and code generation from the simplified constraints. The 
information we collect is closely related to the concept of “iterated control 
dependence” path conditions described by Tu and Padua [44]. 

Rather than computing constraints on variables directly, we compute con- 
straints on value numbers representing unique values of variables. This allows 
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us to avoid invalidating constraints after redefinitions and SSA merge points 
since each of these points simply yields a new value number for the variable. 
Constraints on an old value number may be irrelevant but are never incor- 
rect. Constraints on a value number C at a given statement S in the program 
are logical combinations of equalities and inequalities that hold true for V at 
S. We use integer sets of rank 0 to represent constraints; this enables us to 
exploit the Omega library’s capabilities for symbolic simplification and code 
generation. 

Briefly, our constraint simplification algorithm operates in two passes as 
follows. The first pass collects constraints at conditional branch nodes in a 
single reverse-post-order traversal of CDG. This order ensures that all control 
dependence (CD) predecessors of a node are visited before the node itself, 
except for predecessors along back edges which form cycles in the CDG. At 
each point in the traversal, we collect the constraints for a node as the inter- 
section of the constraints enforced locally at the node with the disjunction 
of the constraints that hold along paths from each of its control dependence 
predecessors. Our single pass algorithm for collecting constraints computes a 
conservative approximation in that it ignores constraints along back edges. ^ 
Despite using a non-iterative strategy for collecting constraints, we are still 
often able to extract useful constraints for loop- variant iterative values, par- 
ticularly auxiliary induction variables recognized when computing value num- 
bers. This includes relatively complex auxiliary induction variables that do 
not have a closed form but are defined by a loop-invariant iterative function 
(e.g., i = il2). 

The second pass makes a reverse preorder traversal of the CDG (visiting 
CD dependents before ancestors) , using the computed constraints to simplify 
a procedure’s control flow. The bottom-up order we choose to simplify con- 
straints is a convenience that simplifies bookkeeping by ensuring that code 
transformations we apply at a node will not eliminate any conditionals that 
we will subsequently reach later in our iteration. If the outgoing constraints 
for a loop or conditional branch are unsatisfiable, its code is eliminated. If 
the outgoing constraints at a conditional branch are implied by its incoming 
constraints, the test is eliminated. Otherwise, the test may still be simpli- 
fied given the known information in the incoming constraints. In this case, 
the simplified constraints are used to regenerate a simpler guard using the 
MMCODEGEN operation. 

The algorithm also takes into account assertions specifying “known” con- 
straints about program variables inserted into the code by the programmer 
or previous phases of the compiler. These assertions provide a mechanism for 
communicating information that a later compiler pass may be unable to infer 



^ It is safe to ignore these constraints because our use of value numbers en- 
sures that constraints along back edges will never invalidate existing constraints, 
rather they would just add more information. 
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directly from the code. We refer the reader to [34] for the details of the overall 
algorithm, including the handling of iterative constructs and assertions. 



6.3 Evaluation and Discussion 

Figure 6.3 shows the simplified code for the Erlebacher intermediate code 
shown in Figure 6.2. In Figure 6.3, we see that all infeasible and tautologi- 
cal guards have been eliminated, the latter being replaced by their enclosed 
code. We also evaluated the algorithm for 3 benchmarks (Tomcatv from the 
Spec92 benchmark suite, Erlebacher, and Jacobi) [34]. The algorithm elimi- 
nated between 31% and 81% of guards introduced by the compiler for these 
programs, yielding between 1% and 15% improvement in execution time on 
an IBM SP-2. These improvement are achieved over and above the many 
aggressive optimizations in dHPF and scalar optimizations performed by the 
SP-2 node compiler. Overall, the control-flow simplification is able to com- 
pensate for the lack of complete context information in the earlier code gener- 
ation phases (and for the overly simple CPs for communication statements), 
permitting the earlier phases to be simpler without impacting performance. 

An interesting outcome we observed in our experiments is that the gen- 
eral purpose control-flow simplification algorithm (in combination with our 
CP code generation algorithm and loop splitting) provides some or all of 
the benefits of much more specialized optimizations such as vector-message 
pipelining [43] and overlap areas [14]. In particular, the pipelined shift com- 
munication pattern for Erlebacher shown in Fig. 6.3 is exactly what vector 
message-pipelining aims to produce, but the latter is a narrow optimization 
and complex to implement, as explained in [34]. In dHPF, however, it results 
naturally from loop-splitting, bounds reduction, and control-flow simplifica- 
tion. As a second example, overlap areas are specifically designed to simplify 
referencing non-local data in shift communication patterns, but a compre- 
hensive implementation of overlap areas requires interprocedural analysis. 
Even without overlap areas, we obtained equally simple code for shift com- 
munication patterns in both Tomcatv and Erlebacher, and nearly as simple 
code in Jacobi. This happened because non-local loop-splitting and control- 
flow simplification together eliminated all or most of the ownership guards 
on non-local references, so that it was equally simple to access data out of 
separate non-local buffers as out of overlap areas. In fact, the code for Tom- 
catv without overlap areas slightly outperformed the code with overlap areas 
(by about 3%) because overlap areas required an extra unpacking opera- 
tion [34]. Overall, these results and the evaluation described above indicate 
that the control-flow simplification algorithm can be a useful general-purpose 
optimization for parallelizing compilers. 
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7. Conclusions 

The motivation for the development of the Rice dHPF compiler is the need 
for HPF compiler technology that approximates or exceeds hand-coded per- 
formance across a broad spectrum of programs. Meeting our goal of offering 
consistently high performance will require a large collection of optimizations 
that are uniformly applicable to a wide variety of programs. 

In this chapter, we described several static code generation techniques 
that enable aggressive and robust optimizations in an HPF compiler, and 
yet greatly simplify the implementation of many of these optimizations. We 
described a computation partitioning model, a framework for analysis and 
optimization, and a code generation strategy that are more general than 
those used previously by HPF compilers. The key conclusions we draw from 
this work are as follows. First, a uniform and comprehensive code genera- 
tion framework can support a very general computation partitioning model, 
and permit harnessing powerful code-generation algorithms that extract the 
full performance of partitionings enabled by the model. Second, the use 
of abstract set equations as the medium for expressing optimizations has 
greatly simplified the construction of even complex optimizations, and there- 
fore makes it practical to incorporate a comprehensive collection of opti- 
mizations in a compiler. In addition, the high-level nature of these equations 
makes the optimizations very broadly applicable (including any computation 
partitionings in the general CP model), and increasing their overall impact. 
Third, a global control-flow simplification algorithm can substantially reduce 
excess control-flow in the generated code, permitting other code generation 
algorithms to be significantly simpler. Finally, the synergy between some of 
these general optimizations provides some or all of the benefits of much more 
specialized optimizations such as vector-message pipelining and overlap areas. 

The base technology (the Omega library) used to implement the integer 
set framework is experimental, but it has proved invaluable for prototyping 
optimizations based on the framework. Further experience from using dHPF 
on a wide variety of programs will be required to judge whether the technology 
is efficient enough to be practical for commercial implementations. If this 
approach proves too costly, it would be important to develop more efficient 
but less precise set representations so that compilers could still obtain the 
power and simplicity of the integer set formulations. 

In this chapter, we have focused on the problem of compiling “regu- 
lar” data parallel codes. Many applications have features that cause them 
to fall outside this class, and in such cases different compilation strategies 
and run-time support are more appropriate. Our hierarchical framework for 
code generation enables different code generation strategies to be applied to 
a scope, or in some circumstances even to a single statement. This capability 
provides the flexibility needed to integrated alternate methods such as in- 
spector/executor or run-time resolution to cope with programs that are not 
statically analyzable. 
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Although the analysis and optimization described here focused on com- 
piling for a message-passing target machine, most if not all of this technology 
is directly applicable to uniform and distributed shared memory (i.e., SMP 
and DSM) systems. Some of the optimizations built on the program analysis 
framework are aimed at avoiding communication by maximizing data access 
locality, which is essential for DSM systems as well. Furthermore, for shared- 
memory systems built from commodity microprocessors, managing locality 
by managing the memory hierarchy is essential for performance. The analysis 
and code generation capabilities described here are guided primarily by array 
layouts and provide the right leverage for exploiting locality and optimizing 
for the memory hierarchy on this emerging class of systems. 
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do i = 1, 10 

if (c .ge. 4 .and. i .le. 6) 

-^1 = {[hi] : 1 < i < 10 A do j = 1, 5 

l<i<5A c>l} sl(i, j) 

s2(i, j) 

if (i .ge. 7 .and. c .ge. 4) 

I 2 = {[hi] :l<i<6A doj = l, 5 

l<i<5 A c>4} sl(i,j) 

if (c .le. 3) 
do j = 1 , 5 

known = {c>l} sl(i,j) 

enddo ! i 

(a) Two iteration spaces. (b) Code template from mmcodegen. 

Fig. 7.1. Constructing a code template from iteration spaces. 

Appendix A. MMCODEGEN 

Here we briefly introduce Keiiy, Pugh & Rosser’s aigorithm for “muitipie- 
mappings code generation” [26 , which we refer to as mmcodegen. This 
aigorithm is avaiiabie as part of the Omega library, and serves as a cornerstone 
for a variety of code generation tasks in the dHPF compiler. The inputs to 
MMCODEGEN are as follows: 

MMCODEGEn(/i . . . known, effort) : 

Ii . . . Iv ■ Iteration spaces for v statements, 
known : A set of rank 0, giving constraints on global variables in 7i . . . 7„ 

that will be externally enforced. 

effort : Integer specifying to remove conditionals effort + 1 inner loops 

From these inputs, mmcodegen synthesizes a code template that enumer- 
ates the tuples in 7i . . . 7^, in lexicographic order, , where the same tuple 
in different sets is ordered as: {i | 7j) ^(i T Ik),j < k. The constraints in 
known are assumed true and will not be enforced within the code template. 
The innermost effort + 1 loops in the code sequence will not contain con- 
ditionals. Figure 7.1 shows two iteration spaces and the corresponding code 
template generated with effort = 0, i.e., one level of guard lifting. In the 
code template, si and s2 are placeholders representing the first and second 
statements, respectively. The digit sufRx in the placeholder name represents 
the index of the iteration space it represents. 

When MMCODEGEN is applied to a set of rank zero, it degenerates to syn- 
thesizing an IF statement that tests the constraints of the set. This variant 
arises (for example) for generating guards to enforce CPs for statements out- 
side loops, for control-flow simplification, and for code generation for testing 
at runtime whether communication can be performed in place. 
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Summary. In data-parallel languages such as High Performance Fortran and For- 
tran D, arrays are mapped to processors through a two-step process involving align- 
ment followed by distribution. A compiler that generates code for each processor has 
to compute the sequence of local memory addresses accessed by each processor and 
the sequence of sends and receives for a given processor to access non-local data. 
In this chapter, we present a novel approach to the address sequence generation 
problem based on integer lattices. When the alignment stride is one, the mapping 
is called a one-level mapping. In the case of one-level mapping, the set of elements 
referenced can be generated by integer linear combinations of basis vectors. Using 
the basis vectors we derive a loop nest that enumerates the addresses, which are 
points in the lattice generated by the basis vectors. The basis determination and lat- 
tice enumeration algorithms are linear time algorithms. For the two-level mapping 
(non-unit alignment stride) problem, we present a fast novel solution that incurs 
zero memory wastage and little overhead, and relies on two applications of the so- 
lution of the one-level mapping problem followed by a fix-up phase. Experimental 
results demonstrate that our solutions to the address generation problem are signif- 
icantly faster than other solutions to this problem. In addition, we present a brief 
overview of our work on related problems such as communication generation, basis 
vector derivation, code generation for complex subscripts and array redistribution. 



1. Introduction 

Distributed memory multiprocessors are attractive for high performance com- 
puting in that they offer potentially high levels of flexibility, scalability and 
performance. However, programming these machines to realize their promised 
performance — which requires a full orchestration of the execution through 
careful partitioning of computation and data, and placement of message 
passing — remains a difficult task. The extreme difficulty of writing correct 
and efficient programs is a major obstacle to the widespread use of paral- 
lel high-performance computing. The main objective behind efforts sueh as 
High Performance Fortran (HPF) [13,20], Fortran D [10], and Vienna For- 
tran (which grew out of the earlier SUPERB effort) [4] is to raise the level 
of programming on distributed memory machines while retaining the object 
code efficiency derived for example from message passing. 

These languages include directives — such as align and distribute — that de- 
scribe how data is distributed among the processors in a distributed-memory 
multiprocessor. For example, arrays in HPF are mapped to processors in two 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 597-645, 2001. 
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steps: in the first step, arrays are aligned to an abstract discrete Cartesian 
grid called a template; the template is then distributed onto the processors. 
The effect of this two-level mapping onto p processors is to create p disjoint 
pieces of the array, with each processor being able to address only data items 
in its locally allocated piece. Thus, an HPF compiler must generate code for 
each processor (called node code) that accesses only the locally owned pieces 
directly, and inserts communication for non-local accesses. 

In order to generate node code for each processor, we need to know the 
sequence of local memory addresses accessed by each processor and the se- 
quence of sends and receives for a given processor to access non-local data. 
A regular access pattern in terms of the global data structure can appear 
to be irregular within each locally allocated piece. For example, an array 
section A{£ :h:s) exhibits a regular access sequence of stride s; but with an 
HPF-style data mapping, the access sequence can become irregular. In this 
chapter we present efficient algorithms for generating local memory access 
patterns for the various processors given the alignment of arrays to a tem- 
plate and the distribution of the template onto the processors. For the case 
where the arrays are aligned identically to the template (also called one-level 
mapping), our solution [39] is based on viewing the access sequence as an 
integer lattice and involves the derivation of a suitable set of basis vectors for 
the lattice. Given the lattice basis, we enumerate the lattice by using loop 
nests; this allows us to generate efficient code that incurs negligible runtime 
overhead in determining the access pattern. Chatterjee et al. [5] presented an 
0{k log fc -flog min(s,pfc)) algorithm (where k is the block size - see Section 2 
for definition) for this problem; ours is an 0{k logmin(s,pfc)) algorithm. 
Recently, Kennedy et al. [16] have also presented an 0{k -|- logmin(s,pfc)) 
algorithm. Note that all these algorithms require computing the gcd{s,pk) 
which is the reason for the logmin(s,pfc) term in the complexity. Experiments 
demonstrate that our algorithm is 2 to 9 times faster than the algorithm of 
Kennedy at al. and 13 to 60 times faster than the algorithm in Chatterjee 
et al. Independently, Wang et al. conclude based on extensive experiments 
that “The LSU algorithm consistently outperforms the RIACS and Rice algo- 
rithm ...” [45]. Our solution to the address generation problem for alignment 
followed by distribution (i.e., two- level mapping) uses two applications of our 
solution to the one-level mapping problem followed by an efficient and novel 
fix-up phase. This second phase is up to 10 times faster than other current 
solutions that do not waste local memory. 

This chapter is organized as follows. In Section 2, we present the problem 
setting and discuss related work. Section 3 thru 7 discuss one-level mapping in 
detail, while Sections 8 thru 10 address two- level mapping. Section 3 outlines 
our approach using lattices and presents key mathematical results which are 
used later in the chapter. In Section 4, we present our linear algorithm for de- 
termining basis vectors, and contrast it with the algorithm of Kennedy et al. 
In Section 5 we show how to determine address sequences by lattice enumer- 
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(a); Layout of an array distributed CYCLIC(fc) onto p processors. 
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(b): Local layout of array shown in Fig. 1.1(a) in Processor m. 



Fig. 1.1. Global and local addresses for data mappings 



ation using loop nests. We show how to use the lattice basis vectors derived 
in Section 4 to generate a loop nest that determines the address sequence. 
Section 6 discusses optimizations applied to the loop enumeration strategy 
presented in Section 5, specifically the GO-LEFT and GO-RIGHT schemes. Sec- 
tion 7 demonstrates the elHcacy of our approach using experimental results 
comparing our solution to those of Chatterjee et al. and Kennedy et al. In 
Section 8 we introduce the two-level mapping problem in detail; we several 
new solutions to the two-level mapping problem in Section 9 and provide 
experimental results for this case in Section 10. In addition to these prob- 
lems, our research group has addressed several additional problems in code 
generation and optimization such as communication generation, code gen- 
eration for complex subscripts, runtime data structures, and runtime array 
redistribution; a brief outline of these is presented in Section 11. Section 12 
concludes with a summary. 



2. Background and Related Work 

We consider an array A identically aligned to the template T ; this means that 
if A{i) is aligned with the template cell T{ai + b), then the alignment stride 
a is 1 and the alignment offset 6 is 0. Further let template T be distributed 
in a block-cyclic fashion with a block size of k across p processors; this is also 
known as a CYCLIC(fc) distribution [13]. If fc = 1 the distribution is called 
CYCLIC, and if fc = ^ (where N is the size of the template) the distribution 
is called a BLOCK distribution. We assume that arrays have a lower limit of 
zero, and processors and local addresses are numbered from zero onward. 
This mapping of the elements of A to the processor memories is shown in the 
Figure 1.1(a). Though the elements of A are stored in the processor memories 
in a linear fashion as shown in Figure 1.1(b), we adopt a two-dimensional view 
of the storage allocated for the array as shown in Figure 1.1(a). We view the 
global addresses as being organized in terms of courses, each course consisting 
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Table 2.1. Symbols used in this chapter. 



A 

T 

a 

b 

k 

P 

I 

h 

s 

A{£ : h : s) 
A2D 
Aloe 
m 

r 



a distributed array 

the template to which array A is aligned 

stride of alignment of A to the template T 

offset of alignment of A to the template T 

block size of distribution of the template 

number of processors to which the template is distributed 

lower bound of regular section of A 

upper bound of regular section of A 

stride of regular section of A 

a regular section of array A (array section) 

two-dimensional view of an array A 

local portion of array A allocated on a processor 

processor number (0 m p 0 1) 

array section lattice 

part of r incident on processor m 



of pk elements. In other words, we assign a block of k cells of the template 
to each of the p processors and then wrap around and assign the rest of the 
cells in a similar fashion. In the two-dimensional view we adopt, the first 
dimension denotes the course number (starting from zero), and the second 
dimension denotes the offset from the beginning of the course. We refer to the 
two-dimensional view of an array A as A 2 D] and the element A(i) has a 2-D 
address of the form {x,y) = {id\\pk,i modpfc) in this space. Similarly the 
local address of an element A{i) (denoted using Aioc) mapped to a processor 
m is k (idivpk) + i mod k. 

An array section in HPF is of the form A(£ :h: s), where s is the access 
stride, and £ and h are the lower and upper bounds, respectively. Table 2.1 
summarizes the notation. Given an array statement with HPF-style data 
mappings, it is our aim to generate the address sequence for the different 
processors. 

Consider the case of an array aligned identically to a template that is 
distributed CYCLIC (4) onto 3 processors, which is accessed with a stride of 7 
(p = 3, fc = 4 and s = 7). Figure 2.1 shows the allocation of the array elements 
along with the corresponding global addresses. The array elements accessed 
are marked and the corresponding local addresses are shown in Figure 2.1. 
While the global access stride is constant (7 in this case), the local access 
sequence does not have a constant stride on any processor. For example, the 
local addresses of elements accessed in processor 1 are 3, 8, 14, 25, 31, . . .. The 
address generation problem is to efficiently emimerate this sequence. 
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Fig. 2.1. Layout of array A for p = 3 and A: = 4 along with the global and 
local addresses of the accessed elements in A(0 : 95 : 7) (s = 7); Superscripts 
denote local addresses. 



2.1 Related Work on One-Level Mapping 

Several papers have addressed this code generation problem. Koelbel [19] de- 
rived techniques for compile-time address and communication generation for 
BLOCK and CYCLIC distribution for non-unit stride accesses containing a sin- 
gle loop index variable. MacDonald et al. [21] provided a simple solution for 
restricted case where the block sizes and the number of processors are powers 
of two. Chatterjee et al. [5] derived a purely runtime technique that identifies 
a repeating access pattern, which is characterized as a finite-state machine. 
Their 0{k log fc-|-logmin(s,pA:)) algorithm involves a solution of k linear Dio- 
phantine equations to determine the pattern of addresses accessed, followed 
by sorting these addresses to derive the accesses in linear order. Gupta et 
al. [12] derived the virtual-block and the virtual- cyclic schemes. The virtual 
block (cyclic) scheme views the global array as a union of several cyclically 
(block) distributed arrays. The virtual cyclic scheme does not preserve the 
access order in the case of DO loops; this is not a problem for parallel ar- 
ray assignments. For large block sizes, this approach may suffer from cache 
misses [5]. They present a strategy for choosing a virtualization scheme for 
each array involved in array statement, based on indexing overheads and not 
on cache effects. In an exhaustive study of this problem, Stichnoth [33,34] 
presented a framework (that bears similarities to the approach of Gupta et 
al. [12]) to enumerate local addresses and generation of communication sets. 
Banerjee et al. [2] discuss code generation for regular and irregular applica- 
tions. Coelho et al. [6] present a survey of approaches to compiling HPF. 
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Ancourt et al. [1] use a linear algebra framework to generate code for fully 
parallel loops in HPF; their technique does not work for DO loops. Midkiff [22] 
presented a technique that uses a linear algebra approach to enumerate local 
accesses on a processor; this technique is similar to the virtual block approach 
presented by Gupta et al. [12]. van Reeuwijk et al. [32] presented a technique 
which requires the solution of linear Diophantine equations. Benkner [3] pre- 
sented a solution for code generation for block-cyclic distributions in Vienna 
Fortran 90. 

Kennedy, Nedeljkovic, and Sethi [16] derived an 0{k + logmin(s,pfc)) al- 
gorithm for address generation. The improvement over Chatterjee et al.’s 
algorithm comes from avoiding the sorting step at the expense of solving an 
additional set of fc 0 1 linear Diophantine equations. In Sections 3 thru 7, 
we present our improved solution to the one-level mapping problem. Unlike 
Kennedy et al. [16] who solve /c0l linear Diophantine equations, our approach 
requires the solution of just two linear Diophantine equations. In addition, 
we present an efficient loop-nest based approach to enumerate the array ad- 
dresses in order to derive the address pattern. Wijshoff [46] describes access 
sequences for periodic skewing schemes (used in providing efficient data or- 
ganization in parallel machines) using lattices and derived closed forms for 
the lattices. He does not discuss code generation. Wang et al. [45] discuss 
experiments with several address generation solutions and conclude that the 
strategy described by us in this chapter (and in [39]) is the best strategy 
overall. 

Other Work on One-Level Mapping from Our Group: In [36 38,41], 
we presented closed form expressions for basis vectors for several cases. Using 
the closed form expressions for the basis vectors, we derived a non-unimodular 
linear transformation; the matrix associated with this transformation has 
a determinant equal to the inverse of the access stride. In an experiment 
with a large set of values for the parameters p (the number of processors), 
k (block size) and s (the access stride), we derived the best pair of basis 
vectors using the closed form expressions for 82% of the problem instances. 
In later sections, we show that basis vector generation dominates address 
generation. Recently, we [25] have derived a runtime solution for the basis 
vector generation problem whose complexity is 0(logmin(s,pfc)), which is 
simply the complexity of computing the required gcd. In contrast, all the 
other algorithms known to date for basis generation have a complexity of 
0{k + logmin(s,pfc)) or worse. 



2.2 Related Work on Two-Level Mapping 

A few methods have been proposed to address the code generation problem 
for two-level mapping. The solution by Chatterjee et al. [5] involves two 
applications of the one-level algorithm, where the input strides are a for the 
first and a s for the second. They build two finite state machines which 
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will generate the access sequence for the allocated and the accessed elements. 
The next step involves using the finite state machines for the template space 
to rebuild a new finite state machine for the local address space. This fix- 
up step involves computing expensive integer divide and mod operations to 
determine the addresses of accessed elements in the local memory space. An 
added drawback of their technique when compared to our technique is the 
fact that their one-level pattern tables contain local memory gaps and not 
actual addresses. However their execution preserves lexicographic ordering 
and they do not incur any memory wastage. 

Ancourt et al. [1] presented a solution for the two-level mapping problem; 
it involves a change of basis which leads to compression of holes but still 
incur some memory wastage. The node-code generated by this framework 
has complicated loops and incurs high execution overhead; their execution 
order does not preserve lexicographic ordering. 

Kaushik [15] uses processor virtualization to handle two-level mapping 
with block-cyclic distributions. His method involves the generation of ad- 
dresses for both hole-compression and for the one without hole-compression. 
First, the amount of memory that must be allocated is determined using a 
regular section characterization for block and cyclic distributions. Then, this 
regular section characterization is extended to the virtual processor approach 
for handling block-cyclic distributions. His technique does not ensure lexico- 
graphic execution in the case of virtual cyclic approach. In addition, memory 
wastage that grows with the amount of allocated storage is incurred. 



3. A Lattice Based Approach for Address Generation 

In the next few sections, we present a novel technique based on integer lat- 
tices for the address generation problem presented in the previous section. 
We first show that the accessed array elements of an array section belong 
to an integer lattice. We then provide a linear time algorithm to obtain the 
basis vectors of the integer lattice. We then go on to use these basis vectors 
to generate a loop nest that determines the access sequence. The problem of 
basis determination forms the core of code generation for HPF array state- 
ments; thus, a fast solution for this problem improves the performance of 
several facets of an HPF compiler. Also, we also provide a few optimizations 
of our basis determination algorithm. 



3.1 Assumptions 

In the next several sections, we present our approach to address generation 
for an alignment stride, a = 1. For a > 1, we use an approach similar to the 
one developed in [5], which involves two applications of our algorithm; this 
approach is discussed in Section 8 thru 10. 
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We assume that A is identically aligned to the template T. As it is evident 
from Figure 1.1(a), we assign a block of k cells of the template to each of 
the p processors and then wrap around and assign the rest of the cells in a 
similar fashion. As mentioned earlier, we treat the global address space as 
a two dimensional space and every element of an array A{i) has an address 
of the form {x,y) = {idivpk,i mod pk) in this space. Here x is the course 
number to which this element belongs and y is the offset of the element in 
that course. We refer to the two-dimensional view of an array A as A 2 D', this 
notation is used throughout this chapter. Similarly, Aioc refers to the locally 
allocated piece of array A on a processor. 

3.2 Lattices 

We use the following definitions of a lattice and its basis [11]. 

Definition 31. A set of points xi, X 2 , «< xw ini is said to be independent 
if these points do not belong to a linear subspace 0/ T "■ of dimension less than 

k. 



For k = n, this is equivalent to the condition that the determinant 
of the matrix X whose columns are Xi (1 i n) is non-zero, i.e., 
det ([xiX 2 ^^nj) ^ 0. Now we state the following definition from the theory 
of lattices [11]; see [11] for a proof. 

Definition 32. Let bi . b?.. «< b„ be n independent points. Then the set A 
of points q such that 

q = Uibi -I- U 2 b 2 -h <«- f M„bn 

(where ui, ... ,Un are integers) is called a lattice. The set of vectors bi, «< b„ 
is called a basis of A. The matrix B = [bib 2 ^bn] is called a basis matrix. 

Definition 33. Let A be a discrete subspace 0/ T "■ which is not contained in 
an (n 0 1)- dimensional linear subspace 0/ t "• Then A is a lattice. 

We refer to the set of global addresses (elements) over all processors accessed 
by A{£ :h:s) with distribution parameters {p, k) as T = :h:s),p, k/AXe 

refer to the set of local addresses accessed by processor m in executing its 
portion of A{^ :h:s) with distribution parameters {p, k) as Tm- By construc- 
tion, F is a discrete subgroup of | ^ and in general, is not contained in a 
1-dimensional linear subspace of | if a single vector can be used to gener- 
ate the elements accessed, our algorithm handles this as a special case. Thus, 
without loss of generality, T = ^{£ : h: s),p,k/\s a lattice. Similarly Tm is 
also a lattice. 

In order to find the sequence of local addresses accessed on processor m, 
one needs to: 
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This region contains 
aii eiements accessed 
before z 



This region contains 
aii eiements accessed 
after z 



Fig. 3.1. Explanation of basis determination 



1. find a set of basis vectors of the lattice Fm', and 

2. enumerate the points in Fm using integer linear combinations of the basis 
vectors. 

Our solution to Step 1 (presented in Section 4) uses the fact that a basis 
for the lattice can be computed from a knowledge of some of the points in 
the lattice. Our solution to Step 2 (presented in Section 5) uses the fact that 
if (with a given origin) every point in the lattice can be generated as non- 
negative integer linear combinations of a suitable pair of basis vectors, then 
these points can be enumerated by a two-level loop (each level with a step 
size of 1); in addition, this two- level nest can be derived by applying the 
linear transformation where B is the basis of the lattice. 

Definition 34. A basis B of the lattice A is called an extremal basis if the 
set of points q that belong to A can be written as 

q = Uibi -h U 2 I 02 -h <«- f M„bn 

where ui, . . . ,Un are non-negative integers. 

This chapter presents an algorithm for determining an extremal basis of the 
array section lattice ^(£ :h: s),p,k /hnd shows how to use the extremal basis 
to generate the address sequence efficiently. 



4. Determination of Basis Vectors 

In this section, we show how to derive a pair of extremal basis vectors for the 
lattice Fm- In order to do that, we state a key result that allows us to find a 
basis for the lattice given a set of points in the lattice. 
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Result 41 - Let bi . b9. «< b„ be independent points of a lattice yl in | 
Then A has a basis a2, «< an'Osuch that 



bi = Wfciak (i = l,...,n) 

k=l 

where uu > 0 and 0 Uki < uu {k < i‘,i = 1 , , n). In addition, the set of 
points ^1, b2, ^^ b„Ois a basis of the lattice A if and only if uu = 1 . 

While this result allows us to decide if a given set of independent points form 
a basis of the lattice, it is not constructive. But for n = 2 , we derive the 
following theorem which allows us to construct a basis for the array section 
lattice on processor m. We use the vector o to refer to the origin of the lattice. 

Theorem 41 . Let bi and b2 be independent points of a lattice A in]'^ such 
that the closed triangle formed by the vertices o, bi and b2 contains no other 
points of A. Then 1Poi,b20*s o, basis of A. 

Proof. Let 5fai,a2<)be any basis of A. From Result 41 , we can write 

bi = Miiai 

b2 = Ui2&\ + U22^2 

where uu > 0 , U22 > 0 and 0 U12 < U22- Based on the hypothesis, the side 

of the triangle connecting vertices o and bi does not contain other points of 
A. Therefore, u\\ = 1 . 

Let us assume that U22 > L If U12 = 0 , the triangle formed by o, bi and 
b2 contains the point a2 <=A; similarly, if U12 => 1 , the triangle formed by 
o, bi and b2 contains the point ai +a2 <= 4 . This contradicts our hypothesis. 
Hence, U22 = L Since, un = U22 = 1 , it follows from Result 1 that the vectors 
bi and b2 form a basis of A. 

Thus, in order to determine a basis of the array section lattice on processor 
m, we need to find three points (one of which can be considered as the origin 
without loss of generality) not on a straight line such that the triangle formed 
by them contains no other points belonging to the lattice. Let xi, X2, and 
X3 be three consecutively accessed elements of the array section lattice on 
processor m. If xi, X2, and X3 are independent points (do not lie on a straight 
line) , then the vectors X2 0 xi and X3 0 X2 form a basis for r^n . Recall that 
we view the array layout as consisting of several courses on each processor 
with each course consisting of k elements; this allows us to refer to each of 
the k columns on a processor. For the array section A{£ :h: s), let c/ be the 
first column in which an element is accessed and let ci be the last column in 
which an element is accessed. Let Zf be some element accessed in column c/ 
and let z\ be some element accessed in column ci by a processor. Let xP*'®'' 
denote the element accessed immediately before x in lexicographic order on 
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k 



Fig. 4.1. No lattice point is a non-positive linear combination of basis vectors 



a processor, and denote the element accessed immediately after x in 

linear order on a processor. We do not discuss the case where Zf and 
(or are in the same column. This case is handled separately and is 

easily detected by our technique. 

Theorem 42. The set of points Zf, z"®^*<0 generate a basis of Tm- 

Proof. From Theorem 41, the set of points ^z^ Zf , z”®^*<(>generate a basis 
of Pm if there are no lattice points in the triangle enclosing them and if they 
are independent. By construction, these are consecutive points in the lattice 
Pm and therefore, there no lattice points in the triangle (if any) enclosing 
them. Suppose these are not independent, i.e., they lie on a straight line. 
This implies one of the following two cases: 

Case (a): Column(Zf < Column(zf) < Column(z"®^*). 

Case (b): Column(z"®^*) < Column(zf) < Column(Zf 

Neither of these cases hold, since Column(zf) = Cf is the first column on 
processor m in which any element is accessed. Therefore, the three points are 
independent. Hence the result. 

Similarly, the set of points zi, z"®^*0also generate a basis of Pm- We 

use the set l^zJC Zf , z”®^*0 We refer to the vector Zf (g) as 1 = (hjh) 
and the vector z”®^* (g)z as r = (ri, r 2 ). Again by construction, p > 0,T2 > 0, 
I 2 < 0 and ri =>0. This is illustrated in Figure 3.1 on p. 605. 

4.1 Basis Determination Algorithm 

In order to obtain a basis for the lattice, we need to find three points be- 
longing to the lattice not on a straight line such that the triangle formed by 
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Fig. 4.2. No lattice points in region spanned by vectors 1 and -r 



-I / 




Fig. 4.3. No lattice points in region spanned by vectors -1 and r 



them contains no other lattice point. Chatterjee et al. [5] suggested a way to 
locate lattice points by solving linear Diophantine equation for each column 
with accessed elements. For details on their derivation refer [5]. The smallest 
solution of each of these solvable equations gives the smallest array element 
accessed in the corresponding column on a processor. Using this we show that 
we can obtain a basis for an array section lattice which generates the small- 
est element in each column that belongs to Fm by solving only two linear 
Diophantine equations. 

The first two consecutive points accessed on a given column and the first 
point accessed on the next solvable column form a triangle that contains no 
other points. Again, let c/ be the first column in which an element is accessed 
on a processor. Let zj be the first element accessed in column Cf. Since the 
access pattern on a given processor repeats after elements, the point 
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Fig. 4.4. Starting points in the columns generated by the vectors 0 and 

Zs^Zf. 



accessed immediately after z/ in column c/ is zj + ■ Now if Cg is the 

second column in which an element is accessed on the processor, and Zg is the 
first element accessed in it, then without loss of generality Zf, Zf + 
and Zg form a basis for the array section lattice. Hence the vectors 0 and 
Zg iS) Zf form a pair of basis vectors of the array section lattice. 

The elements Zf and Zg can be obtained by solving the first two solvable 
Diophantine equations in the algorithm (Lines 4 and 6) shown in Figure 4.5. 
Figure 4.4 shows the basis vectors generated as explained above whereas Fig- 
ure 4.5 gives an outline of how the new basis vectors could be used to access 
the smallest array element accessed in each column for the case where Zg lies 
on a course above or below z f . Our basis determination algorithm works as 
follows. First we use the new basis to walk through the lattice to enumerate 
all the points on the lattice before the pattern starts to repeat. Then we use 
these points to locate the three independent points Zf , that form 

a triangle that contains no other lattice point. Using these three points we 
obtain the new basis of the lattice which we use to walk through the lattice 
in lexicographic order. Thus, we now need to solve only two Diophantine 
equations to generate the basis of the array section lattice that enumerates 
the points accessed in lexicographic order. This new basis determination algo- 
rithm performs substantially better than that proposed by Kennedy et al. for 
large values of k. 
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Input: Layout parameters (p,k), regular section {£ : h : s), processor number m. 
Output: start address, end address, length, basis vectors r = (ri,r 2 ) and 1 = (hjh) 
for m. 

Method: 

1 {d,x,y) i — ExTENDED-EuCLiD(s,pfc); length ■<— 2; start ■<— ft, + 1 

2 last < — t — 1; first < — bmin < — last + 1; amax first — 1 

3 i ; i_end km — £ + k — I 

4 if i > i_end then return ±, J_, 0, J_, _L, _L, J_ /* No element */ 

5 amin ■<— bmax <— Zf ■<— £ + f (ia: +pfc[^^D; i ■<— i + d 

6 if i > i^end then return Zf, Zf, 1, _L, L, L, ± /* One element */ 

7 loc -i— Zs ■«— £ + |(ia; length 2 

8 if Zs < Zf then 

9 amin •«— amax •«— Zs; vec2 Zs — zf, vecl •«— + vec2 

10 else 

11 bmin ■<— bmax ■<— reel ■<— z® — z/; uec2 ■<— reel — 

12 endif 

13 if reel < vec2 then 

14 loc ■<— loc + reel; i ■<— i + d 

15 while i < i^end do 

16 if loc > last then 

17 loc •«— loc — ^ 

18 endif 

19 if loc < Zf then /* loc is accessed before z/ */ 

20 amax < — max{amax , loc); amin ■«— min(amin, Zoc) 

21 else /* loc is accessed after z/ */ 

22 bmax i — ma,x{bmax,loc); bmin ■<— min(bmin, loc) 

23 endif 

24 loc < — loc + reel; i < — i + d; length i — length + 1 

25 enddo 

26 else 

27 loc ■<— loc + vec2; i ■<— i + d 

28 while i < i^end do 

29 if loc < first then 

30 loc ■(- loc 

31 endif 

32 if loc < Zf then /* loc is accessed before z/ */ 

33 amax < — ma.x{amax , loc); amin ■«— min(amm, Zoc) 

34 else /* loc is accessed after z/ */ 

35 bmax < — ma,x{bmax,loc); bmin min(bmin, loc) 

36 endif 

37 loc < — loc + vec2; i -i — i + d; length i — length + 1 

38 enddo 

39 endif 

40 {start, end, Ii,l 2 ,ri,r 2 ) ■<— CGMPUTE-VECTORS (z/, p, k, s, d, amin, amax , 

bmin, bmax) 

41 return start, end, length — 1, ft, I 2 , ri, r 2 



Fig. 4.5. Algorithm for determining basis vectors 
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Input: Zf, p, k, s, d, amin, amax , bmin, bmax. 

Output: The start memory location, end memory location and the basis 
vectors r = (ri,r 2 ) and 1 = (hjh) for processor m. 

Method: 

1 if amin = Zf and amax = —1 then /* above is empty */ 

3 I 2 < — Zf mod k — bmax mod k 

4 ri ^ L^J -L^J 

5 V 2 < — bmin mod k — Zf mod k 

6 else if bmin = and bmax = Zf then /* below is empty */ 

7 ^ L^J - L^J 

8 I 2 <— Zf mod k — amax mod k 

9 ri ^ L^J + S - L^J 

10 C 2 < — amin mod k — Zf mod k 

11 else /* neither above nor below is empty */ 

12 ^ 

13 I 2 < — Zf mod k — amax mod k 

14 ri ^ - [^\ 

15 T 2 < — bmin mod k — Zf mod k 

16 endif 

17 start •<— amin 

18 end i — bmax 

19 return start, end, h, I 2 , ri, C 2 

Fig. 4.6. COMPUTE-VECTORS procedure for basis vectors determination algorithm 



4.2 Extremal Basis Vectors 

In this section, we show that the basis vectors generated by our algorithm in 
Figure 4.5 form an extremal set of basis vectors. 

Theorem 43. The lattice Fm (the projection of the array section lattice on 
processor m ) contains only those points which are non-negative integer linear 
combinations of the basis vectors 1 and r. 

Proof. Let z be the lexicographically first (starting) point of the lattice Fm- 
As r and 1 are the basis vectors of the lattice, any point q belonging to the 
lattice can be written as 

q = z + uil + V 2 r 

where r = (ri,r 2 ) and 1 = (^ 1 ,^ 2 )- Also, l\ >0,^2 < 0,ri ^0 and r 2 > 0 by 
construction. Suppose q -^Fm and that either one or both of ui and v^ are 
negative integers. There are two cases to consider: 

Case (1): (ui < 0 and V 2 < 0) viF + V 2 Vi < 0 

As both vi and V 2 are negative, it is clear from Figure 4.1 that q lies in the 
region above the start element z. This contradicts our earlier assumption 
that z is the start element on processor m. Hence, ui < 0 and U 2 < 0 
cannot be true. 
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Case (2): (wi < 0 or ^2 < 0) 

Without loss of generality we assume that the start element z on a pro- 
cessor lies in the first non-empty column. 

Let i>i =>0 and V 2 < 0. As shown in Figure 4.2, q lies in the region to the 
left of z since v\l 2 + V 2 V 2 < 0. This contradicts our assumption that q ^Fm- 
Hence i>i ^0 and i >2 < 0 cannot be true. 

If ui < 0 and i >2 =>0, then the next element accessed after the origin is 
either z -|- r or z -|- r (g) 1. If the next accessed element of Fm is z -|- r 0 1, 
then this point should be located on a course above z or on a course below 
z. If this element is located on a course above z, it would not be a point 
in the lattice Fm- If this element is located on a course below z, then this 
element is lexicographically closer to z than the point z-|-r which is impossible 
(due to the construction of r). By the above arguments (as can be seen in 
Figure 4.3), we have shown that the next element accessed after z can only 
be z-|-r. A repeated application of the above argument rules out the presence 
of a lattice point in the shaded regions in Figure 4.3. If z is not in the first 
accessed column on processor m, similar reasoning can be used for the vector 
1. Hence, the result. 



4.3 Improvements to the Algorithm for s < fc 

If s < fc, it is sufficient to find only s-|-l lattice points instead of k lattice points 
(as in [16]) in order to derive the basis vectors. Our implementation uses this 
idea which is explained next. Figure 4.7 shows that the access pattern repeats 
after s columns. A close inspection of the algorithm for determining the basis 
of the lattice for the case where s < k reveals that the basis vector r will 
always be (0, s). We also notice that since we access at least one element on 
every row, the first component of the basis vector 1 , i.e., h must always be 
1. Hence, if s < fc, all we need to solve for is the second component of 1 i.e., 
I 2 . This results in a reduced number of comparisons for the s < k case; in 
addition, there is no need to compute bmin, since it is not needed for the 
computation of F- 

Consider the case where where I = 36, p = 4, fc = 16 and s = 5 which is 
shown in Figure 4.7; We illustrate this case for processor number 2. Running 
through lines 1-38 of the algorithm in Figure 4.5 for this case, we get Zf = 96, 
amin = 36, amax = 46, bmin = 101 and bmax = 301 for processor 2. Since 
the pattern of the smallest element accessed in a column repeats after every 
s columns, it is sufficient to enumerate the elements accessed in the last s 
columns to obtain amax and bmax. amin can be obtained by subtracting a 
suitable multiple of s from the smallest element in the last s columns lying 
above Zf. By making these changes to the algorithm in Figure 4.5 for the 
case where s < k, we can obtain the required input for the COMPUTE- VECTORS 
procedure. 
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Fig. 4.7. Addresses of elements of array A accessed on processor 2 for the 
case p = 4, k = 16, s = 5 and offset I = 36 



The improved algorithm to determine the basis vectors for each processor is 
as follows. 

1. Solve the Diophantine equation corresponding to the first solvable column 
to obtain Zf. 

2. Solve the Diophantine equations corresponding to the last and last but 
one solvable columns to obtain z\ and zi_i respectively. Use these two 
solutions in a similar way as shown in Figure 4.5 to generate a pair of 
basis vectors for the lattice vecl and vec2 in terms of offsets. 

3. Using the above basis vectors enumerate all the points in the last s 
columns starting from the last solvable column. By comparing these ele- 
ments to Zf , we get the smallest and the largest element accessed in these 
s columns that lie below Zf, namely bmin and bmax . Similarly, we find 
the smallest and the largest elements that lie above Zf, namely amin 
and amax . 

4. If the region above Zf is not empty then, amax = amax and amin = 
amin ^is (where is is a suitable multiple of s). If Z lies on the processor 
then amin = 1. 

5. If both I and Z 0 s lie on the processor then bmax = I -|- If the region 
below Zf is not empty then, bmin = bmin 0 is (where is is a suitable 
multiple of s) and bmax = bmax . 

6. Generate the lattice basis using the COMPUTE-VECTORS procedure. 

4.4 Complexity 

Line 1 of the algorithm in Figure 4.5 is the extended Euclid step which 
requires 0(log min(s,pfc)) time. Lines 2 thru 40 require 0(min(s,fc)) time. 
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Thus, the basis generation part of our algorithm is 0(logmin(s,pfc) + 
min(s, k)); the basis generation portion of the algorithm in Kennedy et al. [16] 
is 0{k + logmin(s,pfc)). We note that the address enumeration part of both 
the algorithms is 0{k). Experiments have shown that the basis determina- 
tion part dominates the total time in practice. See Section 7 for a discussion. 
Thus, our algorithm is superior to that of Kennedy et al. [16]. 

5. Address Sequence Generation by Lattice 
Enumeration 

As mentioned earlier, we will treat the global address space as a two dimen- 
sional space, and each element of an array A{i) has an address of the form 
{x,y) = (idiv pk,i mod pk) in this space. We refer to the two-dimensional 
view of an array A as A 2 d- The sequence of the array elements accessed 
(course by course) in a processor can be obtained by strip mining the loop 
corresponding to A{£ :h:s) with a strip length of pk and appropriately re- 
stricting the inner loop limits. In the following analysis we assume that 1 = 0 
and h = IV 0 1. At the end of this section we will show how the code gen- 
erated for A(0 : A 0 1 : s) can be used to generate the code for A{£ :h:s). 
The code for the HPF array section A(0 : A 0 1 : s) that iterates over all the 
points in the two dimensional space shown in Figure 1.1(a) could be written 
as follows: 

DO i = 0,4^\ 

DO j = 0,pfc 0 1 
A2D{i,j) = ««« 

ENDDO 

ENDDO 

We apply a non-unimodular loop transformation [23, 24] to the above loop 
nest to obtain the points of the lattice. Since the access pattern repeats after 
the first ^ courses, we limit the outer loop in the above loop nest to iterate 
over the first | courses only. In this case the global address of the first 
element allocated to the processor memory is mk, where m is the processor 
number. So in order to obtain the sequence of local addresses on processor 
m, we need to apply the loop transformation to the following modified code: 

DO z = 0, f 0 1 

DO j = mk, mk -|- A: 0 1 
A2D[i,j] — 

ENDDO 

ENDDO 

The basis matrix for the lattice as derived in the last section is i? = 
. Hence the transformation matrix T is of the form 



\ h ri 

j h T2 
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^ 



T2 Giri 



where A = l\r 2 0 = s (since h > 0, ri ^0 , 12 0, and V 2 > 0). The loop 

bounds can be written as follows: 

' 01 0 r 1 0 

1 0 : V r : 

0 0l < I j - 0mfc ^ 

0 1 , ' L mk + A: 0 1 , 



where 



01 J 0 
1 0 
0 01 



IT 0 

1 0 

0 01 

0 1 



= 



h ri 
h T2 



mk + A: 0 1 
0 

®mk 

mk + A: 0 1 



mk + A: 0 1 



and u and v are integers. Therefore, 



\ l\u + r\v 
j I 12U + T2V 

We now use Fourier-Motzkin elimination [7] on the following system of in- 
equalities to solve for integral u and v: 

0liM 0 r\v 0 

hu + riv - 0 1 



&2U 0 r2V 
I 2 U + V 2 V 



®mk 

mk -|- A: 0 1 



If ri > 0 we have the following inequalities for u and v: 



{—mk — k ^ l)ri 



mk — uh —uli 



< u < 



(I — l)r 2 — mkri 



< V < min 



mk + k — 1 — UI 2 5 ~ 1 “ wh 



The node code for processor m if ri >0 is: 

pg _ 1 ((g)mfcg)fc+l)r'i / \ (f 01)r-20m.fcri 



DoJu = [max \l4^, ^ // 



A-p[hu \riv,l2U + r2 




mk-\-k010ul2 



ENDDO 



ENDDO 
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If ri = 0 we have the following inequalities for u and v: 



To \ s 

0 u — \-®l, 
s \a 



mk 0 uU 



( ^ 



mfcM- /c 0 1 0 ul2 
T2 



The node code for processor m if ri = 0 is: 
DO M = O,^(|0l) 

^ 1 mk0ul'^, j \ mk-\-k<^10ul'^. 



AJd [hu, l 2 % f r 2 f] = «<<9C< 



ENDDO 



ENDDO 



Example 51. Code generated for the case where ^ = 0, p = 3, A: = 4 and 
s = 11 for processor 1. 



The set of addresses generated by the algorithm in Figure 4.5 is ^88, 77, 66, 
550 Also z = 88, amin = 55, amax = 77, bmin — 132 and bmax = 88. 
Since the below section is empty we execute lines 5 and 6 of the algorithm in 
Figure 4.6. So our algorithm returns 1 = (l,0l) and r = (8,3) as the basis 
vectors. The access pattern is shown in Figure 5.1. The node code to obtain 
the access pattern for processor 1 is: 



DO u = 



056 
11 

DO'n = 






02 I 
11 



max 



4-\-u 

3 



0u 

8 



'd [w + 8i>, 0m + 8-cj = 



7+u 

3 



ENDDO 



ENDDO 



lOSm I 



Next we show the iterations of the nested loop and the elements accessed; 
the elements indeed are accessed in lexicographic order. 



u 


Vlb = 


'l^ub — 


accessed 
elements (2D) 


-5 


1 


0 




-4 


1 


1 


(4,7) 


-3 


1 


1 


(5,6) 


-2 


1 


1 


(6,5) 


-1 


1 


1 


(7,4) 



Converting the global two-dimensional address of the accessed elements to 
global addresses we get the global access pattern ^55, 66, 77, 88'0>on processor 
1 which gives the local access pattern ^19, 22, 25, 28<)>on processor 1. 
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Processor 0 Processor 1 Processor 2 




Fig. 5.1. Addresses of the accessed elements of array A along with the 2- 
dimensional view for the case p = 3, k = 4 and s = 11. 



6. Optimization of Loop Enumeration: GO-LEFT and 
GO-RIGHT 

A closer look at Figure 5.1 reveals that even if we generated code that enu- 
merates the points belonging to a family of parallel lines along the vector 
ihjh) by moving from one parallel line to the next along the vector (ri, r 2 ), 
we would still access the elements in lexicographic order. Clearly, in this ex- 
ample, the above enumeration turns out to be more efficient than the earlier 
enumeration. We refer to this new method of enumeration as GO -LEFT, as 
we enumerate all points on a line along the vector (hjh) before we move 
to the next line along the other basis vector. For the same reasons, we refer 
to the earlier method of enumeration as GO-RIGHT. Next we show that the 
GO-LEFT method also enumerates elements in lexicographic order. If B (as 
shown in Section 5) is the basis matrix for the GO-RIGHT case then the basis 

ri 

T2 h 

Ndxt, we shdw that for the pair of basis vectors 
obtained using the algorithm shown in Figure 4.5, the GO-LEFT scheme is 
always legal. 

Theorem 61. Given a point q belonging to the lattice Fm and a pair of 
extremal basis vectors 1 and r obtained using the algorithm in Figure 4-5, 
then on applying as a transformation we maintain the access order. 

Proof Since 1 and r are extremal basis vectors, 

q = z -h I’ll -h i> 2 r 



for the GO -LEFT case is Bl = 



in the GO-LEFT case is Bf^ 



Hence the transformation matrix 



where z is the starting point of Fm and vi and V 2 are positive integers. So 
^next either be q -|- r or q -|- 1 or q -|- 1 -F r. 
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Let us assume that q+r and q+1 This implies that q+r + 1 -<= 

Fm- With this assumption we can have the two following cases, 

Case 1: q + r is lexicographically closer to q than q+1. 

Case 2: q + 1 is lexicographically closer to q than q + r. 

In Case 1, q + 1 is lexicographically closer to q than q + r. So it should be 
clear that q + r should be lexicographically closer to q + 1 than to q, which is 
impossible (due to the construction of r and 1). Hence our assumption that 
q + r -^Fm and q + 1 is not true. A similar argument can be used to 

show that out initial assumption is incorrect for Case 2 also. 

From the above arguments, we conclude that given the starting point of 
Cm, we maintain the lexicographic order of the points accessed by repeatedly 
adding 1 until we run out of Fm and then add a r and continue adding 1 until 
we run out of Fm again and so on. So the access order does not change on 
using Bl as the basis matrix, i.e., applying as the transformation, j]; 



From the above theorem it is clear that GO -LEFT is always legal for the pair 
of basis vectors obtained using the algorithm shown in Figure 4.5. The loop 
nest for ri ^ 0 is: 



DO u = 
A: 

ENDDO 

ENDDO 



^ (gi Sil)(igii2)+(™fc+fc(g)l)L 
s 

(mfc+fe01)0ttr2 (g)ur 

\ h ’ h 

)[riu \ hv,r2U + ^ 2 ^'] = 




mk®ur9, 



The node code for processor m if ri = 0 is: 

pQ ^ — 1 rnkh I \ (j|01)(0i2) + (™fc+fc01); 

DO V 




,, , (mfc+fe01)0ttr2 r, 

A-^lhv.r^u + hv] — «<+ 




mk<S)ur2 
\ h ■ 



ENDDO 



ENDDO 




Next, we need to decide when it is beneficial to use GO-LEFT. The amount 
of work that needs to be done to evaluate the inner loop bounds is the same 
for each outer loop iteration in both the enumeration methods. So an enu- 
meration that results in fewer outer loop iterations is the scheme of choice. 
The number of elements accessed per line in the two cases is a function of 
the block size k and second components of the basis vectors. If V 2 & 2 , we 
use GO-RIGHT; else, we use GO-LEFT. 



Example 6L Code generated for the case where t: = 0, p = 3, /c = 4 and 
s = 11 for processor 1 when we choose to GO-LEFT. 
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Input: start, end, r = (ri,r 2 ), 1 = {hjh)- 

Output: The address sequence. 

Method: 

if T 2 then /* GO-RIGHT */ 

r 2 {start div pk ^ £ div pk) ® r\{start mod pk® £ mod pk) 

!■ start A 

S 

T 2 {end div pk® £ div pk) ® ri {end mod pk® £ mod pk) 

2. Uend / 

s 

3. Scan all the elements on the first line {u start), starting at the start ele- 
ment and then adding r until there are no more elements on this proces- 
sor. 

4. From the previous start add 1 and then add r as many times as necessary 
till you get back onto the processor space. The element thus obtained is 
the start for the new line. Starting at this element keep adding r until 
you run out of the processor space. Repeat this until the line immediately 
before the last line {uend)- 

5. Obtain the start point on the last line as before. Scan all the elements 
along the line from the start by adding r until you reach the end element. 

else /* GO-LEFT */ 

®h{start div pk® £ div pk) -|- l\{start mod pk® £ mod pk) 

1* start )/^ 

®l 2 {end div pk® £ div pk) -h l\ {end mod pk® £ mod pk) 

2. Uend ® 

S 

3. Scan all the elements on the first line {ustart), starting at the start ele- 
ment and then adding 1 until there are no more elements on this processor. 

4. From the previous start add r and then add 1 as many times as necessary 
till you get back onto the processor space. The element thus obtained is 
the start for the new line. Starting at this element keep adding 1 until 
you run out of the processor space. Repeat this until the line immediately 
before the last line {uend)- 

5. Obtain the start point on the last line as before. Scan all the elements 
along the line from the start by adding 1 until you reach the end element. 

endif 



Fig. 6.1. Algorithm for GO-LEFT and GO-RIGHT 
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The basis vectors obtained by running through the algorithms shown in 
Figure 4.5 are 1 = (1, 0l) and r = (8, 3). Hence the resulting node code for 
processor 1 is: 

DO w = 1 — — I 

uu a 

DO n liliax (3w 0 7, 08n) , min (3n 0 4, 10 0 8n) 

A 2 D [8w + f, 3 m 0 n] = «<<5^< 

ENDDO 

ENDDO 

Here we observe that unlike the previous example, we scan all the elements 
along a single line rather than 4 different lines. Clearly in this case going left 
is the better choice. 



6.1 Implementation 

We observe from the example in Sections 5 and 6 that the code generated for 
GO-RIGHT enumerates the points that belong to a family of parallel lines, i.e., 
along the vector r, by moving from one parallel line to the next within the 
family along the vector 1 and the code generated for GO-LEFT enumerates the 
points that belong to a family of parallel lines along the vector 1 , by moving 
from one parallel line to the next within the family along the vector r . So in 
the code derived in Section 5, the outer loop iterates over the set of parallel 
lines while the inner loop iterates over all the elements accessed in each line 
on a given processor. 

From the previous example it can be seen that we may scan a few empty 
lines {i.e., lines on which no element is accessed) in the beginning and the 
end. This can be avoided by evaluating a tighter lower bound for the outer 
loop using the start and end elements evaluated in the algorithm shown in 
Figure 4.5. The start line Ugtart and end line Ug-nd can be evaluated as follows 
(using GO-RIGHT enumeration scheme): 

hu start + riv start = Start div pk^£ div pk 
hUstart + T2Vstart = Start mod pk ® £ mod pk 
huend + fiVend = end A\v pk ® £ A\v pk 
huend + T2Vend = cnd mod pk ® £ mod pk 

Hence, 

r 2 {start div pk® £ div pk) 0 r\{start mod pk® £ mod pk) 

‘Ustart — ? 

5 

T2 {end div pk® £ div pk) 0 ri {end mod pk® £ mod pk) 

'^end ~ • 

The inner loop of the node code evaluates the start element for each itera- 
tion of the outer loop i.e., each line traversed. In our implementation of the 
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loop enumeration we use the start element of the previous line traversed to 
obtain the start element of the next line. This eliminates the expensive inte- 
ger divisions involved in evaluating the start elements on the different lines. 
Figure 6.1 shows our algorithm for loop enumeration. 



7. Experimental Results for One-Level Mapping 

We performed experiments on our pattern generation algorithm on a Sun 
Sparcstation 20. We used the cc compiler using the -fast optimization 
switch; the function gettimeof day () was used to measure time. When com- 
puting the time for 32 processors, we timed the code that computes the access 
pattern for each processor, and report the maximum time over all processors. 
We experimented with block sizes in powers of 2 ranging from 4 to 1024 for 
32 processors. The total times for the two different implementations (“Right” 
and “Zigzag”) of the algorithm in [16] and our algorithm include basis and ta- 
ble generation times. Tables 7.1(a)-7.1(c) show the total times for the above 
three algorithms and the total time for pattern generation for the algorithm 
proposed by Chatterjee et al. [5] (referred to as “Sort” in the tables). For 
very small block sizes, all the methods have comparable performance. At 
block sizes from 16 onward, our solution outperforms the other three. For 
higher block sizes, our pattern generation algorithm performs 2 to 9 times 
faster than the two Rice [16] algorithms. For larger block sizes, if s < fc, our 
algorithm is 7 to 9 times faster than the Rice algorithms because of the need 
to find only s -|- 1 lattice points, instead of k lattice points, in order to find 
the basis vectors. In addition, for larger block sizes, experiments indicate that 
address enumeration time (given the basis vectors) for our algorithm is less 
than that of [16]. From our choice of enumeration, we decide to use GO-LEFT 
for s = pfc 0 1 and use GO-RIGHT for s = pk + 1. Since the algorithms in [16] 
do not exploit this enumeration choice, our algorithm performs significantly 
better. In addition, our algorithm is 13 to 65 times faster than the approach 
of Chatterjee et al. [5] for large block sizes. 

In addition to the total time, we examined the basis determination time 
and the access enumeration times separately. In general, the basis determina- 
tion time accounts for 75% of the total address generation time and is about 3 
times the actual enumeration time. The basis determination times are shown 
in Figure 7.1 and the enumeration times are shown in Figure 7.2. In these 
figures, we plot the times taken for our algorithm (“Loop”), and the best 
of the times for the two Rice implementations. Figure 7.1(a) shows that for 
s = 7, the basis generation time for Loop is practically constant while that 
for Rice increases with block size, k; for k = 2048, the basis generation time 
for Rice is about 50 times that of our algorithm. In Figure 7.1(b) (s = 99), 
it is clear that the basis generation time for our algorithm is nearly constant 
while that for Rice increases from k = 128 onward; this is because of the fact 
that our basis generation algorithm has a complexity 0(min(s, k)) while 
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Table 7.1. Total address generation times (in /is) for our technique (Loop), Right 
and Zigzag of Rice [16] and the Sort approach [5] on a Sun Sparcstation 10 



(a) p = 32; s = 3 and s = 5 



Block Size 
k 


s = 3 


s = 5 


Loop 


Right 


Zigzag 


Egg 


Loop 


Right 


Zigzag 




2 


17 


19 


19 


20 


19 


21 


21 


22 


4 


20 


23 


23 


27 


29 


31 


31 


35 


8 


21 


24 


25 


37 


21 


24 


25 


37 


16 


29 


33 


33 


69 


25 


33 


34 


69 


32 


29 


44 


45 


150 


31 


46 


47 


152 


64 


33 


73 


76 


453 


42 


82 


84 


460 


128 


36 


127 


132 


845 


37 


126 


131 


843 


256 


47 


238 


247 


1638 


48 


236 


246 


1638 


512 


64 


458 


480 


3213 


67 


457 


472 


3221 


1024 


104 


902 


936 


6383 


113 


905 


946 


6384 



(b) p = 32; s = 7 and s = 9 



Block Size 
k 


s = 7 


s = 9 


Loop 


Right 


Zigzag 




Loop 


Right 


Zigzag 




2 


17 


19 


19 


20 


17 


19 


19 


20 


4 


21 


23 


23 


27 


21 


23 


23 


27 


8 


31 


34 


35 


47 


23 


26 


27 


39 


16 


25 


31 


32 


67 


26 


33 


35 


69 


32 


32 


46 


47 


152 


42 


55 


57 


160 


64 


43 


81 


84 


460 


44 


81 


84 


460 


128 


38 


125 


131 


843 


39 


125 


131 


843 


256 


49 


236 


245 


1638 


50 


236 


245 


1637 


512 


76 


464 


487 


3222 


69 


454 


470 


3220 


1024 


105 


891 


938 


6368 


107 


890 


919 


6392 



(c) p = 32; s = 11 and s = 99 



Block Size 
k 


s = 11 


s = 99 


Loop 


Right 


Zigzag 




Loop 


Right 


Zigzag 




2 


27 


29 


29 


30 


35 


37 


37 


38 


4 


37 


39 


39 


43 


45 


47 


47 


51 


8 


31 


34 


35 


47 


55 


59 


59 


71 


16 


35 


41 


42 


77 


51 


58 


58 


93 


32 


32 


44 


47 


151 


50 


64 


64 


170 


64 


37 


73 


75 


452 


71 


91 


98 


469 


128 


50 


135 


141 


854 


97 


149 


152 


865 


256 


67 


252 


261 


1654 


151 


256 


276 


1656 


512 


70 


455 


469 


3221 


114 


453 


470 


3224 


1024 


108 


890 


918 


6389 


154 


887 


916 


6392 
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(a) s = 7 p = 32 (b) s = 99 p = 32 





(c) s = k21 p = 32 (d) s = k+1 p = 32 





(e) s = pk21 p = 32 (f) s = pk+1 p = 32 





Fig. 7.2. Lattice enumeration times (given the basis vectors) for p 
processors for various block sizes and strides 
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Rice has a complexity 0{k). Figures 7.1(c)-(f) indicate that for large 
values of k, the Loop basis generation time is about half that of the Rice 
basis generation time. Figures 7.2(a)-(f) show that for small block sizes, the 
enumeration time for Loop and Rice are comparable, and that from fc = 64 
onwards, the enumeration time for Loop is lower than that of Rice. From these 
figures, it is clear that the Loop algorithm outperforms the Rice algorithm 
in both the basis determination and address enumeration phases of address 
generation. 



Table 7.2. Total address generation times (in jis) for our technique (Loop), Right 
and Zigzag of Rice [16] and the Sort approach [5] on a Sun Sparcstation 10 



(a) p = 32; s = k — 1 and s = fc + 1 



Block Size 
k 


s = A: — 1 


s = A: -I- 1 


Loop 


Right 


Zigzag 


Sort 


Loop 


Right 


Zigzag 


Sort 


2 


17 


19 


19 


20 


19 


21 


21 


22 


4 


20 


23 


23 


27 


29 


31 


31 


35 


8 


31 


34 


35 


47 


23 


26 


27 


39 


16 


26 


33 


33 


69 


26 


33 


33 


69 


32 


33 


45 


45 


154 


33 


45 


45 


155 


64 


64 


84 


87 


459 


64 


82 


90 


459 


128 


92 


141 


141 


853 


93 


134 


144 


853 


256 


149 


255 


255 


1642 


150 


240 


255 


1641 


512 


263 


486 


485 


3219 


263 


459 


486 


3219 


1024 


491 


946 


944 


6375 


493 


894 


946 


6375 



(b) p = 32; s = pfe — 1 and s = pk + I 



Block Size 
k 


s = pk — 1 


s = pk + 1 


Loop 


Right 


Zigzag 


Sort 


Loop 


Right 


Zigzag 


■ma 


2 


17 


19 


19 


20 


15 


17 


17 


18 


4 


18 


21 


21 


25 


16 


19 


19 


23 


8 


19 


24 


24 


37 


17 


22 


23 


35 


16 


23 


31 


31 


67 


20 


29 


30 


65 


32 


29 


45 


45 


155 


26 


43 


44 


147 


64 


43 


73 


72 


449 


38 


72 


74 


449 


128 


70 


128 


128 


843 


62 


129 


132 


844 


256 


124 


239 


239 


1631 


109 


243 


250 


1637 


512 


232 


463 


463 


3209 


204 


474 


486 


3220 


1024 


449 


909 


909 


6365 


395 


932 


958 


6388 



8. Address Sequence Generation for Two-Level Mapping 

Non-unit alignment strides render address sequence generation even more 
difficult since the addresses can not directly be represented as a lattice; in 
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this case, the addresses can be thought of as the composition of two integer 
lattices. This section presents solution to the problem of address generation 
for such a case when the data objects are mapped to processor memories using 
CYCLIC(fc) distribution. We present several methods of generating the address 
sequence. Our approach involves construction of pattern tables which does 
not incur runtime overheads as compared to other existing solutions for this 
problem. We use two applications of the method described in the preceding 
sections to generate the pattern of accesses. 

8.1 Problem Statement 

Consider the following HPF code 
REAL A(N) 

!HPF$ TEMPLATE T(a*N + b) 

!HPF$ PROCESSORS PROGS (p) 

!HPF$ ALIGN A(j) WITH T(a>t=j + b) 

!HPF$ DISTRIBUTE T (CYCLIC (k)) ONTO PROCS 

do i = 0, 

A{1 + is) = «< 

enddo 

A compiler that generates the node code for the above HPF program has 
to generate the set of local elements of array A accessed on processor m. 

To recall, when the alignment stride a > 1, the mapping is called a two- 
level mapping. A non-unit-alignment-stride mapping results in many template 
cells that do not have any array elements aligned to. These empty template 
cells are referred to as holes. We need not allocate memory for holes in the 
local address space during mapping. The challenge then is to generate the 
sequence of accessed elements in this local address space ensuring that no 
storage is wasted on holes. 

Figure 8.1(a) shows the distribution of the template cells onto a proces- 
sor arrangement. For this example, the alignment stride a and access stride 
s are both equal to 3. The number of processors p is 4 and we assume a 
CYCLIC(4) distribution; this example is from Chatterjee et al. [5]. Now, we 
define a few terms used here. The set of global indices of array elements 
that are aligned to some template cell on a processor is called the set of 
allocated elements. The set of global indices of accessed array elements that 
are aligned to some template cell on that processor is called the set of ac- 
cessed elements. These aecessed elements are however a subset of alloeated 
elements. For the given example, 1,6, 11, 16, <f50 is the set of alloeated el- 
ements and 1f0,6, 27,33,48, «O is the set of aecessed elements for processor 
0. Figure 8.1(b) shows the local address space of template cells on all the 
processors. The problem of deriving the aecessed elements for this template 
space is similar to a one-level mapping problem where the stride s is replaced 
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(a): Global layout of template cells on p = 4 processors. 
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(b): Local memory layout for template cells 
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(c): Local memory layout for array cells. 



Fig. 8.1. Two-level mapping of array A when a = 3, s = 3, p = 4, fc = 4 and 
I = 0 
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INDEX 
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1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


ALLOCATED 




3 




9 


12 
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18 




24 




30 


33 



(a) 



ACCESSED 


0 


6 


21 


27 


(b) 


PATTERN 


0 


2 


7 


9 



(a) 

Fig. 9.1. Local addresses of accessed and allocated elements along with two- 
level access pattern when a = 3, s = 3, p = 4, fc = 4 and I = 0 



with a s. However using this method we incur huge memory wastage and 
suffer from data locality resulting in higher execution times. 

If one eliminates holes in this layout, we can have a significant savings in 
memory usage. This can be achieved by viewing the local address space as 
shown in Figure 8.1(c). This local address space does not have any memory 
wastage. However additional work at address generation time has to be done 
to switch from the template space to the local space. Due to the absence of 
these holes we can expect improved data locality and thus leading to faster 
execution times. The address generation problem now is to generate the set 
of elements aceessed in this local address space, efficiently at runtime. 



9. Algorithms for Two-Level Mapping 

The algorithms proposed in this section solve the problem of generating ad- 
dresses for a compressed space for two-level mapping. These algorithms ex- 
ploit the repetitive pattern of accesses by constructing pattern tables for the 
local address space. These pattern tables are then used to generate the com- 
plete set of accesses for the array layout just like in the case of one-level 
mapping. 

The main idea behind these algorithms is to construct tables that store 
the indices needed to switch from the template space to the local space. 
Since we do not allocate memory for holes, we have no memory wastage. 
We also do not incur high costs for generating access function to switch 
from the template space to local address space, this leads to faster execution 
times. This coupled with the fact that no memory is wasted proves that 
these methods are superior to any other existing methods that access array 
elements lexicographically. 
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Fig. 9.2. Two dimensional coordinates of allocated and accessed elements 
along with liable and two-level access pattern when a = 3, s = 3, p = 4, 
k = A and I = 0 



The algorithms for two-level mapping discussed in this chapter can be 
broadly classified into two groups. These algorithms differ mainly in the man- 
ner in which the tables are constructed in order to switch from the template 
space to the local address space. The first algorithm constructs a table of 
offsets whereas the algorithms in the second method uses different search 
techniques to locate accessed elements in the set of allocated elements in the 
compressed space. 

All these algorithms first view the address space as an integer lattice and 
use basis vectors to generate the access sequence of both allocated and ac- 
cessed elements. The basis vectors are generated using our one-level address 
generation algorithm discussed earlier. Two applications of the one-level al- 
gorithm with input strides being a and a s in each case, generates the set 
of accesses for both allocated and accessed elements. Figure 9.1(a) shows the 
first set of repetition pattern of local addresses of the set of accesses for allo- 
cated elements. The numbers in boxes are the set of elements aceessed, and 
the pattern of repetition of these elements is shown in Figure 9.1(b). In a 
compressed space we need to locate the position of these accessed elements, 
in the a set of allocated elements. So we record the positions of these entries 
in a separate table as shown in Figure 9.1(c). The main objective of these 
algorithms is to generate this switching table that helps in switching from 
the template space to the local compressed space. The construction of these 
switching tables is discussed in the following sections. 



9.1 liable: An Algorithm That Constructs a Table of Offsets 

The main idea behind this algorithm is to construct a table of offsets, which is 
used to help switch from the non-compressed space to the compressed space. 
The algorithm exploits the repetition of accesses of both allocated and ac- 
cessed elements. This algorithm first generates a two dimensional view of the 
set of accesses of both allocated and aceessed elements for the non-compressed 
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Input: Layout parameters {p,k), loop limits access stride s, alignment stride 

a for array A, processor m 
Output: TwoJevel 
Method: 

1 di -<r- gcd(o,pfe) 

2 d 2 <— gcd{a * s , pk) 

3 (lengtha,Yi,Y 2 ) ^ oneJevel{p,k,a,m,di) 

4 {lengthas, Xi,X 2 ) oneJevel{p,k,a * s,m,d 2 ) 

5 for i = 0, lengtha — 1 do 

6 Itable[Y 2 [i]] i 

7 enddo 

8 CLcou ^ 

9 for i = 0, lengthas — 1 do 

10 Two-Level[i\ ■<— * Zengtha + ftaWe[X 2 [i]] 

11 enddo 

12 return TwoJevel 

Fig. 9.3. Algorithm that constructs the liable for determining the two-level 
access pattern table 



space. This is done by the application of the one-level algorithm with input 
strides being a and a s respectively. Recording the two-dimensional view of 
these sets does not incur any extra overhead such as expensive division and 
modular operations due to the way we generate the set of accesses using the 
one-level algorithm. 

Figures 9.2(a) and (b) lists the two-dimensional coordinates of both ac- 
cessed and allocated elements for the first set of pattern repetition for the 
example discussed previously. The second coordinates of both these sets in- 
dicate the offsets of the elements from the beginning of each course. A quick 
glance clearly indicates that the accessed elements are a subset of allocated 
elements. Using this information the algorithm first builds a table of offsets 
called the liable for the first repetition pattern of allocated elements. The 
allocated access pattern repeats itself after every courses. This table 

records the order in which the offsets of allocated elements are accessed in 
lexicographic order. 

The next stage involves using this table to determine the location of the 
accessed element in the compressed space. The problem now is to find two 
things. Firstly we need to determine the repetition block in which the accessed 
element is located. Secondly we need to find its position among a set of 
allocated elements in that particular repetition block. Finding the repetition 
block in which the element is located is straight forward, as we know the 
number of courses after which the set of allocated elements repeat and the 
length of this set. In order to find the position of the element in a list of 
allocated elements in a particular repetition block we need to index into the 
liable that gives the position of the accessed element based on its offset from 
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the beginning of the course. Hence by finding the repetition block in which the 
element exists and the position of the element in that block we can determine 
the local address of the element. 

A detailed listing of the algorithm is as shown in the Figure 9.3. Lines 
1-4 generate the pattern tables for the case when stride is a and a s. These 
tables record the two dimensional indices of elements accessed. Yi and I 2 
hold the two dimensional coordinates of the allocated elements while Xi and 
X 2 hold the two dimensional coordinates for the accessed elements. Lines 
5-7 construct the liable that records the positions of offsets of allocated el- 
ements accessed in lexicographic order. The length of this table is always 
k. Lines 9-11 generate the two-level pattern table. For each element in the 
accessed element set, a corresponding entry in the liable will help determine 
the location of this element in the allocated set. 

Let us consider the example in Figure 8.1. We see that elements 0, 1, 
6, 11 have offsets 0, 3, 2, 1 respectively from the beginning of the course. 
The two-dimensional coordinates for both allocated and accessed elements 
are as shown in Figures 9.2(a) and (b). Based on the entries in the Y 2 table, 
the liable is constructed and is as shown in Figure 9.2(c). In this case the 
second coordinates of the allocated elements are same as that of the liable, 
but in general the entries in the liable depends on the value of gcd{a,pk). 
The liable is always of size k, as there could be a maximum of k elements for 
each pattern of repetition. In order to construct the Two-Level pattern table, 
let us consider the third entry (5, 1) from the table of accessed elements as 
shown in Figure 9.2(b). This means that the accessed element lies in course 
number 5 and hence falls in the second repetition block of allocated elements. 
The value 1 in the second coordinate corresponds to the offset of the accessed 
element from the beginning of the course. This serves as an index into the 
liable. Hence the value at position 1 of the liable will yield 3 as shown in 
Figure 9.2(c). This value gives the position of the accessed element in that 
particular repetition block. Since we know the number of elements present in a 
single block (which corresponds to 4 in this example) , we can simply evaluate 
4 1 -|- 3 = 7, which gives us the position of the third element accessed in the 
compressed local space. Figure 9.2(d) shows the positions of accessed elements 
among a set of allocated elements for the first set of pattern repetition. Next, 
we discuss some improvements to the algorithm that constructs the liable. 

9.2 Optimization of the liable Method 

As can be seen from the algorithm in Figure 9.3, line 10 that computes the 
Two.Level pattern table includes expensive integer operations, an integer 
multiply and an integer divide. Here, we explore the possibility of reducing the 
number of these expensive operations in the liable algorithm. The key point 
to note is that in the expression /lengtha, both the quantities Ocou and 

lengtha are loop invariant consiants./We improve the performance here by 
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Input: Layout parameters (p,k), loop limits {^,h), access stride s, alignment 
stride a for array A, processor m 
Output: Twodevel 

Method: 

1 di <— gcd(a,pfc) 

2 ^2 •*— gcd(a s,pk) 

3 (lengtha,Yi,Y 2 ) <— oneJevel{p,k,a,m,di) 

4 {lengthas, X 2 ) oneJevel{p,k,a s,m, ^ 2 ) 

K « JL 

^COU ^ 

6 first < — -^i[0] 

7 tmpi <— lengtha 

^cou 

8 tmp 2 •<— first mod Ucou 

9 for i = tmp 2 to acou 0 1 do 

10 lookup jxccilfirst] tmpi 

11 f irst first + 1 

12 enddo 

13 tmpi <— tmpi + lengtha 

14 last <— Xi[lengthas 0 1] 

15 while [first last) do 

16 0 

17 while [i < Ucou and first last) do 

18 lookupjaccilfirst] <— tmpi 

19 z <— i + 1 

20 first ■«— first + 1 

21 enddo 

22 tmpi tmpi + lengtha 

23 enddo 

24 for z = 0 to lengtha 0 1 do 

25 itable[Y 2 [i]] <— i 

26 enddo 

27 for z = 0 to lengthas 0 1 do 

28 Two.Level[i\ lookupMcci[Xi[i]] + itable[X 2 [i]] 

29 enddo 

30 return Twodevel 



Fig. 9.4. liable*, a faster algorithm to compute the itable for determining 
the two-level access pattern by substituting integer divides with table lookups 
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Fig. 9.5. Local addresses of allocated and accessed elements, the replicated 
allocated table, along with two-level access pattern when a = 3, s = 3, p = 4, 
fc = 4 and I = 0 



Input: Layout parameters {p,k), loop limits access stride s, alignment stride 

a for array A, processor m 
Output: TwoJevel 
Method: 

1 di ■<— gcd(o,pfc) 

2 d 2 <— gcd(os,pfc) 

3 {lengtha, pattern^) oneJevel{p,k,a,m,di) 

4 (lengthas, pattern ^ oneJevel{p,k,a * s,m,d 2 ) 

ask 

5 Factor of replication / •«— 

6 Replicate pattern^ by factor / 

7 i ■<— 0; j ■<— 0 

8 while (j < lengthas) do 

9 if [pattern = pattern^[i]) then 

10 TwoJevel[j] ■<— i 

11 

12 endif 

13 i ^ 1 -j- 1 

14 enddo 

15 return TwoJevel 



Fig. 9.6. Algorithm that constructs a two-level access pattern table using 
linear search method 
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using table lookups. Instead of lengthas divisions we need only one division 
and one mod operation; these are needed to compute just the first entry 
' ^ lengtha and the rest can be calculated by exploiting the properties 

ers. This optimization of the liable method is shown in Figure 9.4. 




if nu: 



9.3 Search-Based Algorithms 

The key problem idea in determining the local addresses of accessed elements, 
is to find the location of aceessed elements in a list of allocated elements 
(expanded to accommodate the largest element in the accessed set), since 
accessed elements are a subset of allocated elements. This can be achieved 
by using a naive approach of simply searching for the index of the accessed 
element in the list of allocated elements. In this section we propose new search 
methods that exploit the property that the list of elements are in sorted order. 

The first step in performing these methods is to run the one-level algo- 
rithm to obtain the local addresses of the set of accesses for the first pattern 
of repetition for both allocated and accessed elements. These entries are in 
lexicographic order and are assumed to be stored in pattern^ and pattern^^ 
tables respectively. Note that unlike other techniques we not use the memory 
gap table here since a significant fraction of the work involved in address 
generation for two-level mapping is in the recovery of the actual elements 
from the memory gap table [5]. These tables are as shown in Figures 9.5(a) 
and (b) for the example in Figure 8.1. The entries in these table correspond 
to the local addresses in a non-compressed template space. The table size of 
the former depends on whereas that of the latter table depends on 

k 

gcd(as,pA:) * 

Here we see that not all elements in the accessed set are present in the 
allocated set for the first pattern of repetition. This is due to the fact that the 
accessed elements lie in different repetition blocks of the local address space. 
Hence we need to expand the set of allocated elements in order to represent 
all the elements in the accessed table, before the pattern starts repeating. 
The total number of elements in the first repetition block of allocated table 
in the uncompressed space is whereas the total number of elements 

for the aecessed table is Hence the factor needed to expand the 

allocated elements table is = g . Performing the required 

gcd(a.pfc) gca(^as,pKj 

expansion is straight forward. It involves replicating the first set of allocated 
elements as many times as the factor of replication. This is accomplished by 
copying elements one at a time from the first pattern of allocated elements to 
the extended memory space with a suitable increment. Another possibility is 
to replicate on demand. 

A search now has to be performed to locate the position of an accessed 
element in this new replicated table. Figure 9.5(c) shows the replicated table 
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after expansion for the example discussed previously. The factor for repli- 
cation in this case was found to be 3. Since the length of the table that 
holds the addresses of the accessed elements is never greater than the length 
of the table that holds the allocated elements, the algorithm needs to find 
the locations of common elements from two tables of different size. Several 
search algorithms can be implemented for finding the locations of aceessed 
elements. We discuss an algorithm based on linear search. Several search al- 
gorithms (with and without the need for replication of the set of elements) 
that differ mainly in their complexities and the speed of execution can be 
found elsewhere [9,31,42]. 

Linear Search The algorithm for linear search builds the two-level pattern 
table needed to switch from the template space to the local address space. 
Figure 9.6 lists the complete algorithm. Lines 1-4 discusses the build up of the 
aecessed and allocated table. The next step is to find the factor for replication 
and is as shown in Line 5. This factor is used to replicate the allocated table. 
Once replication is performed, we now need to perform a simple search in 
order to locate the positions of each aceessed element in the allocated table. 
Lines 8-14 shows the search algorithm. For each entry in the aecessed table, 
it determines the location of this element in the replicated allocated set. As 
and when the location is determined, the position is recorded into a Two- 
Level pattern table. The entries in this table reflects the local address of the 
aecessed element in the compressed space. 

Figure 9.5 can be used to explain the functioning of this algorithm. Let us 
consider the element 21 from the aecessed set as shown in Figure 9.5(b). The 
search involves finding the position of this element in the replicated set as 
shown in Figure 9.5(c). This element can be found at location 7. This entry 
is then stored in the Two-level pattern table. The complete pattern table for 
the example is as shown in Figure 9.5(d). Since the search is performed on a 
sorted table of length / leUa and no element of this table is accessed more 
than once, the complexity of the algorithm is 0{f leria)- In addition to the 
linear search method discussed above, one could use binary search. Also, it 
is possible to avoid replicating the elements by generating them on demand 
in the course of a search [9,31,42]. 



10. Experimental Results for Two-Level Mapping 

In order to compare all the above mentioned methods, we ran our experiments 
on a varying number of problem parameters. These experiments were done 
on a Sun UltraSparc 1 Workstation with Solaris 2. The compiler used was 
the Sun C compiler cc with the -x02 flag. Though the experiments were run 
for a large set of input values only a limited number of results and times 
are shown here. In each, case we report only the times needed to construct 
the two-level table, excluding the times taken construct the two one-level 
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Table 10.1. Table generation times (/is) for two-level mapping p = 32, a = 2 
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pattern tables as done in all the techniques discussed in this paper. We fixed 
the number of processors to p = 32 in all our experiments. For each value of 
alignment stride a, we varied both the block-size k and the access stride s. The 
optimized version of the algorithm that constructs the liable, i.e., the version 
that replaces extensive divisions by table lookups (Figure 9.4) described in 
Section 9.2 is referred to as liable*. The best of the search algorithms that 
performs replication was chosen for the results and is referred to as search in 
the tables. The search algorithm that does not perform replication is termed 
as norep in the results. The method due to Chatterjee et al. [5] is termed as 
riacs. Tables 10.1-11.1 show the time it takes to build the two-level pattern 
tables. 
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Table 10.2. Table generation times (/ts) for two-level mapping p = 32, a = 3 
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The results indicate that the times taken by all the above mentioned 
methods depend on the value of k, s and a. If the access and alignment 
strides are small, the liable* and the search techniques are competitive; this 
is because the time taken for replication and the overhead in performing a 
search is very minimal. But as s and k increases we notice that the search 
starts performing worse. This is because as s and k start increasing the time 
for replication in the search dominates over search and renders this method 
inefficient. The construction of the liable forms the major part of time taken 
for two-level pattern build up. This construction is purely a function of a and 
k and not of s. Hence as s increases the times for liable* does not vary widely. 
The method liable* performs the best over a wide range of parameters. 
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Table 10.3. Table generation times (/is) for two-level mapping p = 32, a = 5 
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The method by riacs suffers with large block sizes due to the expen- 
sive runtime overheads. The norep method performs better than the search 
method as s starts increasing. This is due to the fact that we do not pay the 
overhead due to replication. But for large k we see that the times for search 
increases rendering norep inefhcient. 



11. Other Problems in Code Generation 

In this section, we provide an overview of other work from our group on 
several problems in code generation and runtime support for data-parallel 
languages. These include our work on communication generation, code gener- 
ation for complex subscripts, runtime data structures, support for operations 
on regular sections and array redistribution. 
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Table 11.1. Table generation times (/rs) for two-level mapping p = 32, a = 9 
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11.1 Communication Generation 

In addition to problems in address generation, we have explored techniques 
for communication generation and optimization [40,42-44]. A compiler for 
languages such as HPF that generates node code (for each processor) has also 
to compute the sequence of sends and receives for a given processor to ac- 
cess non-local data. While the address generation problem has received much 
attention, issues in communication generation have received limited atten- 
tion; see [15] and [18] for examples. A novel approach for the management 
of communication sets and strategies for local storage of remote references 
is presented in [42,43]. In addition to algorithms for deriving communica- 
tion patterns [40,42,44], two schemes that extend the notion of a local array 
by providing storage for non-local elements (called overlap regions) inter- 
spersed throughout the storage for the local portion are presented [42,43]. 
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The two schemes, namely course padding and column padding enhance local- 
ity of reference significantly at the cost of a small overhead due to unpacking 
of messages. The performance of these schemes are compared to the tradi- 
tional buffer-based approach and improvements of up to 30% in total time are 
demonstrated. Several message optimizations such as offset communication, 
message aggregation and coalescing are also discussed. 

11.2 Union and Difference of Regular Sections 

Operations on regular sections are very common in code generation. The 
intersection operation on regular sections is easy (since regular sections are 
closed under intersection). Union and difference of regular sections are needed 
for efficient generation of communications sets; unfortunately, regular sections 
arc not closed under union and difference operations. We [9, 27] present an 
efficient runtime technique for supporting support for union and other opera- 
tions on regular sections. These deal with both the generation of the pattern 
under these operations as well as with the efficient code that enumerates the 
resulting sets using the patterns. 



11.3 Code Generation for Complex Subscripts 

The techniques presented in this chapter assumed simple subscript functions. 
Array references with arbitrary affine subscripts can make the task of compil- 
ers for such languages highly involved. Work from our group [9,26,29,30,42] 
deals with the efficient address generation in programs with array references 
having two types of commonly encountered affine references, namely coupled 
subscripts and subscripts containing multiple induction variables (MIVs). 
These methods utilize the repetitive pattern of the memory accesses. In the 
case of MIV, we address this issue by presenting runtime techniques which 
enumerate the set of addresses in lexicographic order. Our approach to the 
problem incorporates a general approach of computing in 0{k) time, the 
start element on a processor for a given global start element. Several meth- 
ods are proposed and evaluated here for generating the access sequences for 
MIV based on problem parameters. With coupled subscripts, we present two 
construction techniques, namely searching and hashing which minimize the 
time needed to construct the tables. Extensive experiments were conducted 
and the results were then compared with other approaches to demonstrate 
the efficiency of our approach. 



11.4 Data Structures for Runtime Efficiency 

In addition to algorithms for address sequence generation, we addressed the 
problem of how best to use the address sequences in [8,9]. Efficient techniques 
for generating node code on distributed-memory machines is important. For 
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array sections, node code generation must exploit the repetitive access pat- 
tern exhibited by the accesses to distributed arrays. Several techniques for 
the efficient enumeration of the access pattern already exist. But only one 
paper [17] so far addresses the effect of the data structures used in represent- 
ing the access sequence on the execution time. In [8,9], we present several 
new data structures along with node code that is suitable for both DO loops 
and FORALL constructs. The methods, namely strip-mining and table com- 
pression facilitate the generation of time-efficient code for execution on each 
processor. While strip-mining views the problem as a double nested loop, ta- 
ble compression proves to be a worthwhile data structure for faster execution. 
The underlying theory behind the data structures introduced is explained and 
their effects on all possible set of problem parameters is observed. Extensive 
experimental results show the effieaey of our approach. The results compare 
very favorably with the results of the earlier methods proposed by Kennedy 
et al. [16] and Chatterjee et al. [5]. 

11.5 Array Redistribution 

Array redistribution is used in languages such as High Performance Fortran to 
dynamically change the distribution of arrays across processors. Performing 
array redistribution incurs two overheads: (1) an indexing overhead for deter- 
mining the set of processors to communicate with and the array elements to 
be communicated, and (2) a communication overhead for performing the nec- 
essary irregular all-to-many personalized communication. We have presented 
efficient runtime methods for performing array redistribution [14,35]. In or- 
der to reduce the indexing overhead, precise closed forms for enumerating the 
processors to communicate with and the array elements to be communicated 
are developed for two special cases of array redistribution involving block- 
cyclically distributed arrays. The general array redistribution problem for 
block-cyclically distributed arrays can be expressed in terms of these special 
cases. Using the developed closed forms, a distributed algorithm for schedul- 
ing the irregular communication for redistribution is developed. The gener- 
ated schedule eliminates node contention and incurs the least communication 
overhead. The scheduling algorithm has an asymptotically lower scheduling 
overhead than techniques presented in the literature. Following this, we have 
developed efficient table-based runtime techniques (based on integer lattices) 
that incur negligible cost [9,28]. 



12. Summary and Conclusions 

The success of data parallel languages such as High Performance Fortran 
and Fortran D critically depends on efficient compiler and runtime support. 
In this chapter we presented efficient compiler algorithms for generating lo- 
cal memory access patterns for the various processors (node code) given the 
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alignment of arrays to a template and a CYCLIC (k) distribution of the tem- 
plate onto the processors. Our solution to the one-level mapping problem is 
based on viewing the access sequence as an integer lattice, and involves the 
derivation of a suitable set of basis vectors for the lattice. The basis vector 
determination algorithm is 0(logmin(s,pfc) -|-min(s, k)) and requires finding 
min(s -|- l,fc) points in the lattice. Kennedy et al.’s algorithm for basis de- 
termination is 0(log min(s,pfc) -|- k) and requires finding 1 points in the 
lattice. Our loop nest based technique used for address enumeration chooses 
the best strategy as a function of the basis vectors, unlike [16]. Experimental 
results comparing the times for our basis determination technique and that 
of Kennedy et al. shows that our solution is 2 to 9 times faster for large block 
sizes. For the two-level mapping problem, we presented three new algorithms. 
Experimental comparisons with other techniques show that our solutions to 
the two-level mapping problem are significantly faster. In addition to these 
algorithms, we provided an overview of other work from our group on several 
problems such as 

— efficient basis vector generation using an 0(logmin(s,pfc)) algorithm [25]; 

— communication generation [40,42-44]; 

— code generation for complex subscripts [9,26,29,30,42]; 

— effect of data structures for table lookup at runtime [8,9]; 

— runtime array redistribution [9,14,28,35]; (and) 

— efficient support for union and other operations on regular sections [9,27]. 

Work is in progress on the problem of code generation and optimization for 
general affine access functions in whole programs. 
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Summary. The cost of inter-processor communication is one of the major bottle- 
necks of a distributed memory machine (DMM) which can be offset with efficient 
algorithms for task partitioning and scheduling. Based on the data dependencies, 
the task partitioning algorithm partitions the application program into tasks and 
represents them in the form of a directed acyclic graph (DAG) or in compiler inter- 
mediate forms. The scheduling algorithm schedules the tasks onto individual pro- 
cessors of the DMM in an effort to lower the overall parallel time. It has been long 
proven that obtaining an optimal schedule for a generic DAG is an NP-hard prob- 
lem. This chapter presents a Scalable Task Duplication based Scheduling (STDS) 
algorithm which can schedule the tasks of a DAG with a worst case complexity 
of 0(|u|^), where v is the set of tasks of the DAG. STDS algorithm generates an 
optimal schedule for a certain class of DAGs which satisfy a Cost Relationship Con- 
dition (CRC), provided the required number of processors are available. In case the 
required number of processors are not available the algorithm scales the schedule 
down to the available number of processors. The performance of the scheduling al- 
gorithm has been evaluated by its application to practical DAGs and by comparing 
the parallel time of the schedule generated against the absolute or the theoretical 
lowerbound. 



1. Introduction 

Recently there has been an increase in the use of the distributed memory ma- 
chines (DMMs) due to the advances in the VLSI technology, inter-processor 
communication networks and routing algorithms. The scalability of DMMs 
gives them a major advantage over other types of systems. Some of the ap- 
plications which use DMMs are fluid flow, weather modeling, database sys- 
tems, image processing etc. The data for these applications can be distributed 
evenly onto the processors of the DMM and with fast access of local data, 
high speed-ups can be obtained. 

To obtain maximum benefits from DMMs, an efficient task partitioning 
and scheduling strategy is essential. A task partitioning algorithm partitions 
an application into tasks and represents it in the form of a directed acyclic 
graph (DAG). Once the application is transformed to a DAG, the tasks are 
scheduled onto the processors. This chapter introduces a compile time (static) 
scheduling technique with the assumption that a partitioning algorithm is 
available which transforms the application program to a DAG. 

Generating an optimal schedule for assigning tasks of a DAG onto DMMs 
has been proven to be an NP-Complete [5, 18] problem. There are very few 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 649-682, 2001. 

Springer-Verlag Berlin Heidelberg 2001 




650 Sekhar Darbha and Dharma P. Agrawal 



special cases where the optimal schedule can be generated in polynomial [24] 
time bound. The cases for which optimal schedule can be obtained in poly- 
nomial time are: (i) when unit execution time tasks represented in the form 
of a tree are scheduled onto arbitrary number of processors, or (ii) when unit 
execution time tasks represented by arbitrary task graph are scheduled onto 
two processor architecture. Any relaxation in the above two cases changes 
the complexity of the algorithm to NP-Complete. 

The sub-optimal solutions to the static scheduling problem can be ob- 
tained by heuristic methods which are based on certain assumptions that 
allow the algorithm to be executed in polynomial time bound. Even though 
the heuristics do not guarantee an optimal solution, they have been shown 
to perform reasonably well for many applications. 

The first set of algorithms are priority based algorithms. In these algo- 
rithms, each task of the DAG is assigned a priority and whenever a processor 
is available, the task with the highest priority among all the tasks which are 
ready to execute, is assigned to the free processor. A simple priority based 
algorithm is to assign a value of level (or co-level) to each node of the DAG [1] 
and assign a higher priority to the task which has higher level (or lower co- 
level). The level of any node is the length of the longest path from the node 
to an exit node and the co-level of any node is the length of the longest path 
from the node to an entry node. An entry node is a node which is not depen- 
dent on data generated by other tasks, i.e. the number of incoming edges at 
an entry node is zero and an exit node is a node which does not communicate 
its data to other nodes, i.e. the outgoing number of edges at an exit node 
is equal to zero. When computing the level and co-level, the communication 
costs are ignored and only the computation costs are taken into account. 

The priority based algorithms are simple to implement. But the problem is 
that most of these schemes do not take inter-processor communication (IPG) 
time into account. Even those algorithms which take IPG costs into account, 
suffer from the fact that they try to balance the workload rather than trying 
to minimize the overall schedule length. Recently, many researchers have 
proposed algorithms which have evolved from the priority based schemes 
[15,19,25,27,31,32]. These heuristics do attempt to provide reasonable results, 
while optimal solution is not guaranteed. 

There are many scheduling algorithms which are based on clustering 
schemes [15,16,21,22,28-30,33]. These algorithms try to cluster tasks which 
communicate heavily onto the same processor. A description and comparison 
of some of these algorithms is given in [16]. Even clustering schemes do not 
guarantee optimal execution time. Also, if the number of processors available 
is less than the number of clusters, there could be a problem. 

There are several task duplication based scheduling algorithms. Duplica- 
tion Scheduling Heuristic (DSH) [24] has a very impractical time complexity 
of 0(**), where 4<J|liis the number of nodes of the DAG. Search and Du- 
plication Based Scheduling (SDBS) algorithm [9] gives an optimal solution 
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with complexity of 0(4^41) if certain conditions are satisfied and if adequate 
number of processors is available. The problem with SDBS algorithm is that 
it duplicates tasks unnecessarily and requires a large number of processors. 
Other algorithms using task duplication have been proposed in [2,6,7,26]. 

Critical path method and task duplication based scheduling algorithm 
has been proposed in [8] and is dependent on a very restrictive condition. 
They stipulate that for every join node (defined as a node having more than 
one incoming edge), the maximum value of the communication costs of the 
incoming edges should be less than or equal to the minimum value of the 
computation costs of the predecessor tasks. For example for the join node i 
shown in Figure 1.1, the computation cost of task a (which happens to be the 
predecessor task of task d with the lowest computation cost) is the bottleneck 
for communication costs. The cost of all the edges which are incident on 
node d should be less than or equal to 3. This condition cannot be satisfied 
by the join node of Figure 1.1. The condition introduced in this chapter is 
more flexible and does not let a lower computation cost task to become the 
bottleneck for the edge costs. 




The basic strategy involved in most of the scheduling schemes is to group 
the tasks into clusters and assume that the available number of processors is 
greater than or equal to the number of clusters generated by the algorithm. A 
major limitation with most of these algorithms is that they do not provide an 
allocation which can be scaled down to the available number of processors in a 
gradual manner, if the available number of processors is less than the required 
number of processors for the initial clusters. Also, in case the available number 
of processors is higher than the initially required number of processors, the 
unused or the idle processors must be utilized to obtain a lower parallel time. 
In a DMM, once the resources have been assigned to a user, they remain under 
the user’s control until the program completes its execution. If some assigned 
resources are not used by a user, they remain unutilized. Thus, execution of 
duplicated tasks on unused processors would not pose any overhead in terms 
of resource requirements. 

This chapter introduces a Scalable Task Duplication based Scheduling 
(STDS) algorithm [10-12, 14] which is scalable in terms of the number of 
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processors. The primary motivation of this work is to introduce an algorithm 
which generates an optimal schedule for at least a certain class of DAGs. As 
mentioned earlier, the problem of obtaining optimal schedule for a generic 
DAG has been proven to be NP-Complete. The scheduling problem can be 
simplified by developing a suite of algorithms, each of which can generate 
an optimal schedule for a subset of DAGs. The STDS algorithm is the first 
algorithm of this set of algorithms and can generate an optimal schedule 
if the DAG satisfies the Cost Relationship Condition (CRC) (described in 
section 2). The concept of duplicating critical tasks is used in this algorithm 
which helps in getting a better schedule. 

The rest of the chapter is organized as follows. Section 2 gives a brief 
overview of the scheduling algorithm. The running trace of STDS algorithm 
on an example DAG is illustrated in section 3. The results obtained by the 
STDS algorithm are reported in section 4. Finally section 5 provides the 
conclusions. 



2. STDS Algorithm 

It is assumed that the task graph represented in the form of a DAG is available 
as an input to the scheduling algorithm. The DAG is defined by the tuple 
(u,e,r, c), where v is the set of task nodes, e is the set of edges. The set r 
consists of computation costs and each task i v has a computation cost 
represented by r(i). Similarly, c is the set of communication costs and each 
edge from task i to task j, Cij e has a cost Cij associated with it. In case two 
communicating tasks are assigned to the same processor, the communication 
cost between them is assumed to be zero. Without loss of generality, it can be 
assumed that there is one entry node and one exit node. If there are multiple 
entry or exit nodes, then the multiple nodes can always be connected through 
a dummy node which has zero computation cost and zero communication cost 
edges. 

A task is an indivisible unit of work and is non-preemptive. The under- 
lying target architecture is assumed to be homogeneous, connected and the 
communication costs between a pair of two processors for a fixed length of 
message is the same. It is assumed that an I/O co-processor is available and 
thus computation and communications can be performed simultaneously. 

The STDS algorithm generates the schedule based on certain parameters 
and the mathematical expressions to evaluate these parameters are provided 
below: 

pred{i) = 1 j e<> (2.1) 

succ{i) = If j eO (2.2) 

est{i) = 0, if pred{i) = (f) (2.3) 

est{i) = min max {ect{j),ect{k) + Ck,i) if pred{i) ^ (f> (2.4) 

j pred{i) k pred{i) ,k=j 
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ect{i) = est{i) + t(z) (2-5) 

fpred{i) = jJlfect{j) + Cj^i) [ {ect(k) + Ck,i), ^ pred{i); k pred{i), k^j 

( 2 . 6 ) 

lact{i) = ect{i) if succ{i) = f (2-7) 

I act (i) = min { min {last{j)^Cij), min {last{j))) 

j succ{i),i=fpred{j) j succ{i) ,i=fpred{j) 

( 2 . 8 ) 

last(i) = lact(i) 0 r(i) (2-9) 

level{i) = T{i) if succ{i) = f (2.10) 

level (i) = max (level{k)) + r{i) if suce{i) ^ if (2-H) 

k succ{i) 



The computation of the earliest start time (est) and earliest completion 
time (ect) follows in a top down fashion starting with the entry node and 
terminating at the exit node. The latest allowable start time (last) and latest 
allowable completion time (laet) are computed in a bottom-up fashion in 
which the process starts from the exit node and terminates at the entry node. 
For each task i, a favorite predecessor fpred{i) is assigned using Eq. 2.6, which 
implies that a lower parallel time can be obtained by assigning a task and 
its favorite predecessor on the same processor. The STDS algorithm assigns 
a value of level (i) to each node i which is the length of the longest path from 
node i to an exit node. 

This algorithm will yield optimal results if the CRC given below is satis- 
fied by all the join nodes of the DAG. The CRC guarantees optimality and 
is not a prerequisite for the algorithm to execute. The CRC needs to be true 
only for join nodes. A join node is a node of the DAG where the number of 
predecessor tasks is greater than one. The CRC for join node i is: 

Gost-Relationship Gondition: Let m and n be the predecessor tasks of task 
i that have the highest and second highest values of %eet{j) -h cyi)«%' 
pred(i)'0> respectively. Then one of the following must be satisfied. 

- r(m) i if est{m) | est{n) or, 

— r(m) I {cn^i + est{n) 0 est{m)) if est{m) < est{n) 

The STDS algorithm assigns independent tasks to different processors if 
adequate processors are available. If a join node satisfies the condition, it im- 
plies that optimal schedule for a join node is obtained if only one predecessor 
task of the join node is allocated to the same processor as the join node. If 
the schedule time can be lowered by allocating multiple predecessors of the 
join node to the same processor as the join node, then this condition cannot 
be satisfied. 

The CRC stipulates that the DAG be of coarse granularity. The condition 
is satisfied if the granularity of the DAG as defined in [17] is greater than 
or equal to 1.0. The computation and communication costs can cause the 
granularity of the DAG to be less than 1.0 and still satisfy the condition. Some 
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example DAGs which have low granularity and which satisfy the condition 
are considered in section 4. 

The pseudocode in Figure 2.1 gives an overview of the steps involved in 
this algorithm. The first two steps of the algorithm compute the est, ect, 
fpred, last, lact and level for all the nodes of the DAG. The code for steps 
1 and 2 is shown in Figure 2.2. The latest allowable start and completion 
times can be used to evaluate how critical a set of two directly connected 
tasks are to each other. For example, if task j is successor of task i and if the 
condition {last[j] 0 lact[i]) i Cij is satisfied, then it indicates that task i is 
not critical for the execution of task j, and it is not necessary to execute both 
i and j on the same processor to yield the lowest possible schedule time. For 



Input: 

Tasks (Nodes) 1....N 
Edges 1 M 

Task Computation Costs: r(i) t{N) 

Edge Communication Costs: Cij 
Available number of processors in the system: AP 
Output: A Schedule which can run on the available number of processors. 
Begin: 

1. Compute est{i), eet{i), fpred{i) for all nodes iev. 

2. Compute last{i), lact{i), level{i) for all nodes iev. 

3. Assign tasks to processors in linear cluster fashion and number of 
clusters generated = RP. (Refer to Figure 2.5 ) 

4. If AP > RP, scale up the schedule to obtain better parallel time 
else if AP < RP, reduce the number of processors. (Refer to Fig- 
ure 2.7) 



Fig. 2.1. Overall Code for the STDS Algorithm 



example. Figure 2.3 shows an example DAG and its corresponding schedule 
(as generated by STDS algorithm). The est, ect, last, lact, level and fpred 
for this example DAG are shown in Table 2.1. 

In this DAG tasks 2 and 3 are predecessors of task 5. Since ect(3) -|- 03^5 
is greater than ect{2) + 02 , 5 , task 3 is the favorite predecessor for task 5. 
Ideally, tasks 3 and 5 should be assigned to the same processor to obtain a 
lower completion time for task 5. When the last and lact are computed in the 
second step, it is observed that task 5 can be delayed by 3 time units without 
altering the overall schedule time. Thus, it is not necessary to assign tasks 3 
and 5 to the same processor. Without the knowledge of the latest start times 
and the latest completion times, task 3 would have been duplicated onto the 
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Input: DAG(u,e, r, c) 

pred{i): Set of predecessor tasks for task i. 
succ{i): Set of successor tasks for task i. 

Output: For each task i v. 

Earliest Start Time est{i) 

Earliest Completion Time ect{i) 

Latest Allowable Start Time last{i) 

Latest Allowable Completion Time lact{i) 

Favorite Predecessor fpred{i) 

Level level{i) 

Begin: 

1. Compute est{i), ect{i) and fpred{i) for all nodes i v. 

a) For any task i if pred{i) = (j), then est{i) = 0. 

b) Let T = max%ect{j) + pred(f)<)be obtained by node k. Then, 

est{i) = max^%ect{j) + Cjs)i^ pred{i),j ^ k(),ect{k)^ 

c) fpred{i) = k 

d) ect{i) = est{i) + t{i) 

2. Compute last{i) and lact{i) for all nodes i v. 

a) For any task i if succ{i) = cf), then lact{i) = ect{i) and level{i) = r{i). 

b) For j succ{i). 

i. If f is fpred{j) then temp{j) = last{j). else temp{j) = last{j)^Cij 

ii. lact{i) = minimum%emp{j)i^ succ{i)(}. 
ill. last{i) = lact{i) ® r{i). 

iv. level (i) = maximum^level{j)J^ suee{i)(}+ T{i). 



Fig. 2.2. Code for Steps 1 and 2 of the STDS Algorithm 



Table 2.1. Start and Completion Times for Example DAC 



Node 


est 


ect 


fpred 


last 


lact 


level 


1 


0 


5 


- 


0 


5 


26 


2 


5 


8 


1 


8 


11 


12 


3 


5 


12 


1 


5 


12 


21 


4 


5 


9 


1 


5 


9 


18 


5 


12 


17 


3 


15 


20 


9 


6 


12 


22 


3 


12 


22 


14 


7 


22 


26 


6 


22 


26 


4 
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two processors and the algorithm would have required four processors instead 
of three. 

On the contrary, if t(6) is modified from 10 to 7 (as shown in Figure 2.4), 
the algorithm will generate the schedule shown in Figure 2.4. For this mod- 
ihed DAG, the start and completion times would be obtained as shown in 
Table 2.2. In this case, the start times of none of the tasks can be delayed 
and four processors are required to obtain a lower schedule time. 



Table 2.2. Start and Completion Times for Example DAG 



Node 


est 


ect 


fpred 


last 


lact 


level 


I 


0 


5 


- 


0 


5 


23 


2 


5 


8 


I 


5 


8 


12 


3 


5 


12 


I 


5 


12 


18 


4 


5 


9 


I 


5 


9 


15 


5 


12 


17 


3 


12 


17 


9 


6 


12 


19 


3 


12 


19 


II 


7 


19 


23 


6 


19 


23 


4 



Step three, shown in Figure 2.5, generates the initial tasks clusters and 
is based on the parameters computed in steps one and two and on the array 
queue. The elements in the array queue are the nodes of the task graph 
sorted in smallest level first order. Each cluster is intended to be assigned to 
a different proeessor and the generation of a cluster is initiated from the first 
task in the array queue which has not yet been assigned to a processor. The 
generation of the cluster is completed by performing a search similar to the 
depth first search starting from the initial task. The search is performed by 
tracing the path from the initial task selected from queue to the entry node by 
following the favorite predecessors along the way. If the favorite predecessor 
is unassigned, i.e. not yet assigned to a processor, then it is selected. In case 
the favorite predecessor has already been assigned to another processor, it 
is still duplicated if there are no other predecessors of the current task or if 
all the other predecessors of the current task have been assigned to another 
processor. For example, in the DAG shown in Figure 2.6, task 5 is the only 
predecessor of tasks 6, 7 and 8 and thus task 5 is duplicated on all the three 
processors. In case the favorite predecessor is already assigned to another 
processor and there are other predecessors of the current task which have 
not yet been assigned to a processor, then the other predecessors which have 
not been assigned to a processor are examined to determine if they could 
initially have been the favorite predecessor. This could have happened, if for 
another task k {k pred{i)), {ect{k) + Ck,i) = {ect{j) + where i is the 
current task and j is its favorite predecessor. If there exists such a task k, the 
path to the entry node is traced by traversing through the task k. If none of 
the other predecessors could initially have been the favorite predecessor, then 
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Input: DAG(i!, e, T, 

queue: Set of all tasks stored in ascending order of level. 

Output: Task Clusters 
Begin 

RP = 0 

X = first element of queue 
Assign X to a Prp 

while(not all tasks are assigned to a processor)^ 
y = fpred{x) 

if {y has already been assigned to another processor)^ 

k = another predecessor of x which has not yet been assigned to a 
processor. 

if{{last{x) 0 lact{y)) >= c^^y) then /* y is not critical for x. */ 
y = k 

else 

found = 0; 

for another predecessor z of x, z ^ y 

if {{ect{y) + Cy^x) = {ect{z) + Cz,x) && task z has not yet been 

assigned to any processor) then % = z; found = l;^ 

endif 

if{found = 0) % = k] modproc[i\ = RP] modtask[i\ = x; (} 
endif 

❖ 

endif 

Assign y to Prp 
x = y 

if X is entry node 
assign x to Prp 

X = the next element in queue which has not yet been assigned to a 
processor 

RP H — h; 

Assign x to Prp 
endif 

0 



Fig. 2.5. Code for Step 3 of the STDS Algorithm 






Duplication Based Compile Time Scheduling Method for Task Parallelism 659 

the process of cluster generation can be continued by following through any 
other unassigned predecessor of task i. This process helps reduce the number 
of tasks which are duplicated. The generation of cluster terminates once the 
path reaches the entry node. The next cluster starts from the first unassigned 
task in queue. If all the tasks are assigned to a processor, then the algorithm 
terminates. In this step, the algorithm also keeps track of all the tasks which 
did not make use of the favorite predecessor to complete the task allocation. 





After the initial clusters are generated, the algorithm examines if the 
number of processors required by the schedule {RP) is less than, equal to, 
or greater than the number of available processors (AP). If RP = AP, then 
the algorithm terminates. If RP > AP, the processor reduction procedure is 
executed and if RP < AP, the processors incrementing procedure is executed. 
The code for reducing and increasing the number of processors is shown in 
Figure 2.7. 

In the reduction step, each processor i is initially assigned a value of 
exee(i), which is the sum of execution costs of all the tasks on that pro- 
cessor. For example, if tasks 1,5,9 are assigned to processor i, then exec{i) 
would be equal to (r(l) -|- r(5) -|- t( 9)). After computing the exee{i) of all 
the processors, the algorithm sorts the processors in the ascending order of 
exee{i) and merges the task list of processors. In the first pass the task lists 
are merged to obtain half the initial number of processors. The task lists 
of the processors with the highest and lowest values of exee{i) are merged, 
the task lists of processors with second highest and second lowest values of 
exec{i) are merged and so on. In case the required number of processors is 
odd, the processor with the highest exec{i) remains unchanged and the tasks 




660 Sekhar Darbha and Dharma P. Agrawal 

of the rest of the processors are merged. If the number of required proces- 
sors is still higher than the number of available processors, multiple passes 
through this procedure might be required and in each pass the number of 
processors can be reduced to half of the initial number of processors. When 
merging the tasks, the tasks on the new processor are executed in highest 
level first order. 



Input: Processor Allocation List 

Available Number of Processors: AP 
Required Number of Processors: RP 

modtask: Array of tasks where favorite predecessor was not used 
modproe: Array of processors where corresponding modtask is allocated. 

Output: Modified Processor Allocation 
Begin: 

if{AP > RP) /* Begin Processor Incrementing Procedure */ 

If 

for(i = 0 to {AP (g) RP)-1) 

1 

go to allocation of processor given by modproe [i] 
copy allocation from modtaskfi] downwards to a new processor 
traverse from modtaskfi] back to entry node following favorite prede- 
cessors along the way. 

❖ 

else /* Begin Processor Reduction Procedure */ 

If 

while(AP 

Calculate exec{i) for each processor i, i.e. sum of execution costs of 

all tasks allocated to processor i 

Sort processors in ascending order of exee{i) 

temp = RP 0 AP 

\i{temp > RP/2) temp = RP/2 

Merge task lists of processors j and {temp ~2 0 j 0 I) for j = 0 to 
temp 0 I 
decrement RP 

0 

0 



Fig. 2.7. Code for Step 4 of the STDS Algorithm 
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Increasing order of exec(i) 

Least Busy Most Busy Processor 

PI P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 




Increasing order of exec(i) 

Least Busy ► Most Busy Processor 

S3 S4 SI S2 S6 S5 




Fig. 2.8. Example of Reducing Schedule From 11 Processors to 5 Processors 



An example of merging task lists to reduce the schedule from 1 1 processors 
to 5 processors, is shown in Figure 2.8. The task lists are merged till the 
number of remaining processors is equal to the available processors AP. In 
the first phase, the task lists of 5 sets of 2 processors each are merged to 
reduce the processor count to 6. The value of exec{i) of the new processors 
are computed again. The exec(i) of the new processor will not necessarily be 
the sum of exec{iys of the earlier processors. This is because the task which 
has been duplicated on both the processors need not be executed twice on 
the merged processor. 

Suppose the ascending order of exec{i) for the processors in second phase 
is S3, S4, SI, S2, S6, S5. In the second phase, the task lists of processors S3 and 
S4 are merged to reduce the final processor count by one to five processors. 
The final processor allocation has processors SI, S2, S7, S5 and S6. The task 
list of each of these processors corresponds to the initial allocation as follows: 
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51 : Tasks of Processors PI and PIO 

52 : Tasks of Processors P2 and P9 

S7 : Tasks of Processors P3, P4, P7 and P8 

55 : Tasks of Processors P5 and P6 

56 : Tasks of Processors Pll 

Finally, in the incrementing step, the schedule is scaled to use the surplus 
number of processors. In this step, the algorithm traverses through the task 
graph and increments the number of processors. For each extra processor 
that is available, the algorithm goes to the task on the processor that did not 
use its favorite predecessor when allocating originally. The new allocation 
uses the favorite predecessors for all the tasks while traversing back to the 
entry node. The original tasks list is copied to a new processor. For example, 
in Figure 2.9, task i and task k (favorite predecessor for task i) were not 
allocated to the same processor initially. When a new processor P3 becomes 
available, the traversal list from task i to the entry node on processor P2 is 
copied to the new processor and on processor P2, the traversal from i to the 
entry node is performed using the favorite predecessors. 



Initial Schedule 




Fig. 2.9. Example of the Incrementing part of Step 4 
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2.1 Complexity Analysis 

The first two steps compute the start times and the completion times by 
traversing all the tasks and the edges of the task graph. The algorithm, in 
the worst case would have to examine all the edges of the DAG. Thus, the 
worst case complexity of these steps is 0{J^, where 4^^is the number of 
edges. 

Since the third step deals with the traversal of the task graph similar to 
the depth first search, the complexity of this step would be the same as the 
complexity of general search algorithm which is [3], where 

the number of nodes and JjiJIis the number of edges. Since the task graph is 
connected, the complexity is 0(#^. 

In step 4, the complexity depends upon the difference between the re- 
quired number of processors for the linear clusters and the number of proces- 
sors provided by the system. For each pass through the reduction code shown 
in Figure 2.7, the time required is of the order of 0(4><l|tl-4P4k)g'4P«i^ where 
is the number of tasks and 4P*s the number of processors. Each pass of the 
reduction code reduces the number of processors to half the original number 
of processors. Thus, the number of passes through the reduction code is given 
by: 

RP 

Number of passes through Step A of Algorithm, Q = <iog2{-^)^{2.12) 

The complexity of the reduction step would be 0(Q(4<4rf JfPJkigJIPJIt). In 
the worst case, RP would be equal to d^^and AP would be equal to one 
and Q would be log 2 {ik^ Also, in the worst case 4Pd|twould be equal to 
<l|kd|kThe complexity of this step in the worst case would be O(d|»d|o 52 ( 4 kd^ + 
For all values of 4<d(*the condition log 2 {Jlt^ would be 

true and for 16, the condition {log 2 {^^)‘^) would be satisfied. For 
practical applications, 4<d!kwould always be larger than 16 and consequently, 
the worst case complexity of this step would be 0(4<d|). 

For the incrementing step, in the worst case, for each processor, all the 
nodes of the graph might have to be examined. Thus, the worst case complex- 
ity of this step is 0(4Pd#ul^, where 4P*s the number of available processors. 
If the extra number of processors available i.e. {AP (g) RP) is equal to the 
number of times the favorite predecessor was not used, then optimal schedule 
can be obtained by using favorite predecessor for all those processors. 

The overall time complexity of the algorithm is in or 

depending upon if the processors is less than or greater 
than initially required number of processors. For a dense graph the number 
of edges, d|kd|lis proportional to 0(4ul!t). Also, 4Fd|kvould be much smaller than 
Thus, the worst case complexity of the algorithm in both the cases is 

0 (* 4 ). 




664 Sekhar Darbha and Dharma P. Agrawal 

3. Illustration of the STDS Algorithm 



The working of the STDS algorithm is illustrated by a simple example DAG 
shown in Figure 3.1. The steps involved in scheduling the example DAG are 
explained below: 




Step 1 : Find est, ect and fpred for all nodes i v. The est of the en- 
try node 1 is zero, since pred{l) = cj). Node 1 completes at time 7. 
The est, ect and fpred of other nodes can be computed using Eq. 2.3- 
2.6. For example, node 8 has nodes 3 and 7 as predecessors. Thus, 
est{8) = min^max%ect{3)+C3^s),ect{7)(},max^ect{3), (ect(7)-|-C7^8)'0O= 
22. Since, (ect{7) + cr^s) > {ect{3) -|- C 3 ,s), node 7 is the fpred of node 
8. The est, ect and fpred of all the nodes of the DAG are shown in 
Table 3.1. 
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Step 2 : Find last, lact and level for all nodes i v. The lact of the exit 
node 20 is equal to its ect. The last of node 20 is equal to lact{20) 0 
r(20) = 49. The last, lact and level of other nodes can be computed 
using Eq. 2.7-2.11. For example, node 12 has nodes 13, 14 and 15 as 
successors. Node 12 is the fpred for all three nodes and consequently, 
lact{12) = min%last{13),last{14:),last{15)(}= 34. The value of last{12) 
is lact{12) 0 r(12) = 29. The last, lact and level of all the nodes of the 
DAG are shown in Table 3.1. 

Table 3.1. Start and Completion Times for Nodes of DAG 



Node 


level 


est 


ect 


last 


lact 


fpred 


1 


50 


0 


7 


0 


7 


- 


2 


43 


7 


16 


7 


16 


1 


3 


35 


7 


14 


11 


18 


1 


4 


26 


7 


12 


18 


23 


1 


5 


16 


7 


10 


28 


31 


1 


6 


5 


7 


8 


41 


42 


1 


7 


34 


16 


22 


16 


22 


2 


8 


28 


22 


29 


22 


29 


7 
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21 


22 


27 


26 


31 


7 


10 


13 


22 


25 


33 


36 
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11 


4 


22 


23 


43 


44 
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12 


21 


29 


34 
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34 
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13 


16 


34 


39 


34 


39 


12 


14 


10 


34 


37 


38 


41 


12 


15 
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34 


35 


45 


46 


12 


16 


11 


39 


43 


39 


43 


13 


17 


7 


43 


46 


43 


46 


16 


18 


2 


43 


44 


47 


48 


16 


19 


4 


46 


49 


46 


49 


17 


20 


1 


49 


50 


49 


50 


19 



Step 3: Generate clusters similar to linear clusters. For this DAG, the array 
queue would be as follows: 

queue = 120, 18, 15, 19, 11, 6, 17, 14, 16, 10, 13, 5, 12, 9, 4, 8, 7, 3, 2, 1<> 
The generation of linear clusters is initiated with the exit node, i.e. node 
20. While searching backwards, the path is traced through the favorite 
predecessors of each task. Since this is the first pass, none of the tasks 
have been assigned to any processor. Thus, the allocation list of 120, 
19, 17, 16, 13, 12, 8, 7, 2, l<)is obtained for processor 1. The next pass 
through the search procedure is started from the first unassigned task in 
the array queue which is task 18. The search from task 18 yields task 16 
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as its favorite predecessor. Since, task 16 is already allocated to another 
processor, the search proceeds with task 15. Since, last{18) 0 lact{16) > 
Ci6,i8) task 16 is not critical for task 18. Similarly, for tasks 15 and 11, 
tasks 12 and 7 are their favorite predecessors respectively. Due to similar 
reasons, tasks 12 and 7 are not critical for the execution of tasks 15 and 
11 and thus even if additional processors are available, tasks 12 and 7 
need not be duplicated onto the same processor as tasks 15 and 11. Thus, 
processor 2 has an allocation list of ^1, 6, 11, 15, 180 The next search pass 
starts with task 14. Task 14 has task 12 as the favorite predecessor and it 
is critical because ^ast(14) 0 lact{12) < C12.14. Since task 12 has already 
been allocated to another processor, this fact is noted and if additional 
processor is available task 12 can be duplicated to the same processor as 
task 14. Following this procedure, the task clusters shown in Figure 3.2 
are obtained. In this allocation there are two places where the favorite 
predecessor was not used for generating the clusters. In addition to the 
case of task 14 mentioned above, task 9 allocated to processor 4 did not 
make use of its favorite predecessor, task 7, in the initial allocation. 



P2 

PI 

P5 

P4 

P3 




Fig. 3.2. Initial Processor Schedule 



Step 4a: The number of required processors is 5. If the available number of 
processors is less than 5, then the task lists of different processors need to 
be merged. For example, suppose the number of processors available is 2. 
Then the number of processors needs to be reduced by 3. The procedure 
to reduce the number of processors is as follows: 

1. Find the values of exec{i) for each processor i. The values of exec{i) 
for processors one to five are 50, 11, 16, 17, 14 respectively. 

2. Since processor count is being reduced by more than half, two passes 
through this procedure would be needed. In the first pass the task 
lists of processors (P2,P4) and processors (P3,P5) would be merged. 
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This would give us an allocation for 3 processors SI, S2 and S3. 
Processor SI has the same tasks as processor PI, processor S2 has 
the tasks of processors P2 and P4 combined and processor S3 has 
the tasks of processors P3 and P5 combined. In case, 3 processors are 
available, the reduction procedure would be terminated at this step. 
After getting the allocation for three processors, the exec(i) for the 
three processors have to be computed and sorted based on the value 
of exec{i). The exec{i) of processors SI, S2 and S3 are 50, 21 and 23 
respectively. In the second pass, the task lists of processors S2 and S3 
would be merged to obtain the schedule for the two processors. In case 
four processors are available, only one pass through this procedure 
would be required. If four processors are available, then the tasks of 
processors P2 and P5 would be merged and the rest of the processors 
would remain the same. The modified allocations for the two, three 
and four processors cases are shown in Figure 3.3. 

Step 4b: The initial number of required processors is five. Suppose ten pro- 
cessors are available for the execution of this application. In this DAG, 
there are two places where a favorite predecessor was not used when as- 
signing tasks to processors. For task 9 allocated to processor 4, task 7 
is the favorite predecessor and for task 14 allocated to processor 3, task 
12 is the favorite predecessor. But this is not the way the initial clusters 
were generated. In case one extra processor is available, the allocation 
on processor 4 from task 9 onwards can be copied to the new processor 
and on the original processors traverse from task 9 to the entry node 
following the favorite predecessors along the way. If there is another free 
processor available, the allocation on processor 3 from task 14 onwards is 
copied to the new processor and traverse from task 14 to the entry node 
through the favorite predecessors. In case more than seven processors 
are available, they will remain unused as there are no more tasks which 
are not using their favorite predecessors. Using this modified allocation, 
the processor allocations for six and seven processors case is obtained as 
shown in Figure 3.4. 

For the example task graph. Table 3.2 shows the schedule length that 
would be achieved if the number of available processors is varied from 1 and 
above. Even if there are more than 7 processors, the algorithm utilizes only 
7 processors. 

4. Performance of the STDS Algorithm 

The performance of the STDS algorithm is observed for five cases. The first 
case is if the CRC is satisfied and the required number of processors are 
available. This case reduces to the case of TDS algorithm [13] where optimal 
schedule is guaranteed and the proof is given in section 4.1. Next, the case 
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P2 
PI 

P4 
P3 

Schedule For Four Processors 





P2 




Fig. 3.3. Final Processor Schedule When Number of Processors Less than 
Five 
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Table 3.2. Number of Processors vs. Schedule Length 



Number of 
Processors 


Schedule 

Length 


1 


80 


2 


52 


3 


52 


4 


52 


5 


52 


6 


51 


7 and above 


50 



where either the CRC is not necessarily satisfied, or the required number of 
processors is not available are considered. For this case, extensive simulations 
have been performed. Random data for edge and node costs have been ob- 
tained and the schedule length generated by STDS algorithm for this data 
has been compared to the absolute lowerbound (described in section 4.2). In 
the third case, the STDS algorithm has been applied to practical applications 
which have larger DAGs. In the fourth case, the algorithm has been applied 
on a special class of DAGs, namely the Diamond DAGs. Finally, the STDS 
algorithm has been compared to other existing algorithms with the use of 
small example DAG. 

4.1 CRC Is Satisfied 

A DAG consists of fork and join nodes. The fork nodes can be transformed 
with the help of task duplication to achieve the earliest possible schedule time 
as shown in Figure 4.1. The problem arises when scheduling the join nodes 
because only one predecessor can be assigned to the same processor as the 
join node. In this subsection, it is proven that for join nodes which satisfy the 
CRC, the schedule time obtained by scheduling the join node on the same 
processor as its favorite predecessor, is optimal. The rest of the predecessors 
of the join node are each assigned to a separate processor. 

Theorem 41. Given a join node satisfying the CRC stated in seetion 2, the 
STDS algorithm gives minimum possible schedule time. 

Proof. Figure 4.2 shows an example join node. According to the CRC, tasks 
m and n have the highest and the second highest values of ^ect{j) + cj^i j 
pred{i)(}. It is assumed that task m is assigned to processor m and task n is 
assigned to processor n. Since task m has the highest value of ect{m) -|- Cm,i, 
task i is also assigned to processor m and est{i) = max^ect{m), ect{n) + Cn,if)’. 

It will be proven that the start time of task i cannot be lowered by assign- 
ing tasks m and n to the same processor if the CRC is satished. Thus, tasks 
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Fig. 4.1. Schedule of Fork Node 




m and n have to be assigned to different processors. The other predecessors 
may have any values of computation and communication times, but, task i 
has to wait till ect{m) or ect{n) + Cn^i, whichever is higher. Thus, the other 
tasks will not affect est{i), as long as the condition ect{k) + Ck,i \ect{n) + Cn,i, 
for all k pred{i) and fc ^ m, n is satisfied. Thus, only tasks m and n need 
to be considered among all the predecessors of task i. 

There are two possible cases here. 

Case 1 : est{m) J, est{n). From the CRC stated in section 2, r(m) J, Cn,i 
has to be satisfied. Here again there could be two cases: 

Case 1(a): ect{m) [ ect{n) + Cn^i, i.e. est{i) = ect{m) 

If tasks m, n and i are assigned to the same processor, 

est{i) = max{est{m), {est{n) + r(n))) + r(m) = max{ect{m),ect{n) + 

r{m)). Thus, the start time of est{i) cannot be reduced below its current 

starting time of ect{m) by assigning m, n and i to the same processor. 

Case 1(b): ect{m) < ect{n) + Cn,i, i.e. est{i) = ect{n) + Cn,i 

If tasks m, n and i are assigned to the same processor, the earliest task 

i can start is given by {est{n) + T{n) + T{m)), i.e. ect{n) + T{m). Since 

T{m) I Cn^i, ect{n) + T{m) would be greater than or equal to ect(n) + Cn,i 

Thus est{i) cannot be lowered. 

Case 2: est{m) < est{n). Prom the CRC stated in section 2, T(m) | 
Cn,i + est{n) (S> est{m) has to be satisfied. 
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Case 2(a): ect{m) [ ect{n) + Cn,i, i.e. est{i) = ect{m) 

If tasks m, n and i are assigned to the same processor, est{i) = est{m) + 
T{m) + T{n) = ect{m) + T{n). Thus, the start time of task i cannot be 
lower than its current starting time of ect{m). 

Case 2(b): ect{m) < ect{n) + Cn^i, i.e. est{i) = ect{n) + 

If m, n and i are assigned to the same processor, earliest start time of 
task i would be est{m) + r(m) + r(n). The start time of task i can be 
improved if est{m) + r(m) + r(n) < est{n) + T{n) + Cn,i- In other words, 
if est{m) + r(m) < est{n) + Cn,i, or if r(m) < [est{n) ® est{m)) + 

But it is known that T{m) | {est{n) (S> est{m)) + Thus the start 
time of task i cannot be lowered. )]; 

This proves that if the CRC given in section 2 is satisfied by all the join 
nodes of the DAG then STDS algorithm yields the earliest possible start 
time, and consequently, earliest possible completion time for all the tasks of 
the DAG. 

4.2 Application of Algorithm for Random Data 

In the earlier section, it has been proven that if the CRC is satisfied and 
if the required number of processors are available, then optimal schedule is 
obtained. Here, the idea is to observe the performance of this algorithm on 
random edge and node costs that do not necessarily satisfy the CRC. To 
observe the performance, the example DAG shown in Figure 4.3 has been 
taken and the edge and node costs have been generated using a random 
number generator. For each of these sets of data and for different number of 
processors, the ratio of the STDS generated schedule length and the absolute 
lowerbound have been computed. The absolute lowerbound for an application 
on P processors is defined as follows: 

Absolute Lowerbound = maximum%Level of entry node, — ~ 

i V 

(4.1) 

The first term of the equation, i.e. the level of the entry node of the DAG, is 
the sum of computation costs along the longest path from entry node to the 
exit node. For example, for the example DAG in Figure 4.3, there are several 
paths from the entry node 1 to the exit node 10. The longest path is given by 
1-3-6-8-10, yielding a level of 20 time units. Since the dependencies in this 
linear path have to be maintained, the schedule length can never be lower 
than the level of the entry node. The second term of the above equation is the 
overall sum of all the computation costs divided by the number of processors. 
The schedule length has to be greater than or equal to the second term. Thus, 
the maximum of the two terms will be the theoretical lowerbound, which may 
or may not be practically achievable. 

The random number generator has been used 1000 times to generate 1000 
sets of data. For each of these data sets, the costs were taken as (modulus 
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Fig. 4.3. Example Directed Acyclic Graph 
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Number of Processors 



Fig. 4.5. Ratio of Schedule Lengths Vs. Number of Processors 



100 +1) of the number generated. In effect the edge and node costs he in 
the range of 1 to 100. The random number generator generated DAGs with 
differing granularities as given in [17]. Using this definition of granularity, 
the percentage of graphs versus the granularity values for these data sets has 
been plotted in Figure 4.4. 

For the randomly generated data sets, the ratio of algorithm generated 
schedule and the absolute lowerbound has been obtained as the number of 
processors is varied from 1 to 5 and is shown in Figure 4.5. All these DAGs 
required less than or equal to 5 processors for scheduling. On the average, 
the STDS algorithm required 26% more time than absolute lowerbound to 
schedule these DAGs. 

4.3 Application of Algorithm to Practical DAGs 

The STDS algorithm has been applied on different practical DAGs. The 
four applications are Bellman-Ford algorithm [4,23], systolic algorithm [20], 
master-slave algorithm [23] and Cholesky decomposition algorithm [16]. The 
first three applications are part of the ALPES project [23]. These four appli- 
cations have varying characteristics. The number of nodes of these DAGs is 
around 3000 but the number of edges varies from 5000 to 25000. This gives 
the ranges of sparse to dense graphs. The number of predecessors and suc- 
cessors varies from 1 to 140. The computation and communication costs vary 
in a wide range. The schedule time is generated by the STDS algorithm for 
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different number of processors and is compared against the absolute lower- 
bound. These are shown in Figures 4. 6-4. 9. These figures have four parts and 
the description of these plots is as follows: 

— In the (a) plots, the variation of schedule length is shown as the number of 
processors decreases. It can be observed that the schedule length varies in 
a step-wise fashion. The explanation for this is that the algorithm initially 
merges the task lists of the idle processors. But there comes a stage where 
it is necessary to merge the task lists with the most busy processor. At 
that point, a major jump in the schedule length can be noticed. After that 
point, the schedule length again increases very slowly till the next critical 
point is reached. 

— In the (b) plots, the variation of absolute lowerbound with the number of 
processors has been plotted. 

— In the (c) plots, the interesting portions of the (a) and (b) plots for each 
of the applications has been zoomed. In the plot, the step-wise increase in 
schedule length generated by STDS algorithm is more noticeable. 

— Finally, in (d), the ratio of STDS generated schedule to the absolute lower- 
bound as the number of processors is varied is plotted. It can be observed 
from the plots that the ratios fluctuate very heavily when the number of 
processors is lower. As the number of processors increases the ratios remain 
constant. The reason for this is that as the number of processors is low- 
ered, the schedule length increases rapidly, i.e. the number of steps visible 
is large. Since the absolute lowerbound varies smoothly, the ratio between 
the STDS schedule and the absolute lowerbound is dictated by the STDS 
schedule. 

The characteristics for each application is shown in Table 4.1. 

4.4 Scheduling of Diamond DAGs 

In this section, the performance of STDS algorithm on a special class of 
DAGs, namely the diamond DAGs is observed. The general structure of the 
diamond DAGs is shown in Figure 4.10. These DAGs are similar to master- 
slave DAGs in which the master gives instructions to the slaves and the slaves 
send the result back to the master node. 

If these DAGs satisfy the cost relationship condition, then they provide 
an optimal schedule using n processors, where n is the width of the DAG or 
the number of slave tasks at each level. Figure 4.11 shows a special case of 
diamond DAG where n is 3 and it satisfies the CRC . The schedule generated 
by STDS algorithm for the DAG shown in Figure 4.11 is shown in Figure 4.12. 
The schedule length obtained by STDS algorithm for this DAG is optimal. 
In this schedule, it can be observed that only the critical nodes or the master 
nodes have been duplicated on the three processors. 
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(c) Schedule Lengths 





(d) Ratio of Schedule Lengths 




Fig. 4.6. Plots for Bellman-Ford Shortest Path Algorithm 
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Fig. 4.7. Plots for Cholesky Decomposition Algorithm 
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(a) STDS Schedule Length 




(c) Schedule Lengths 
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(d) Ratio of Schedule Lengths 




Fig. 4.8. Plots for Systolic Algorithm 



(a) STDS Schedule Length 



^ q 8 (b) Absolute Lowerbound 





fi (c) Schedule Lengths (d) Ratio of Schedule Lengths 





Fig. 4.9. Plots for Master-Slave Algorithm 
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Table 4.1. Performance Parameters for Different Applieations 



Characteristics 


Applications 


Bellman-Ford 

Algorithm 


Cholesky 

Decomposition 


Systolic 

Algorithm 


Master-Slave 

Algorithm 


Granularity 
of DAG 


0.002 


0.013 


0.002 


0.002 


Gondition 

Satisfied 


Yes 


Yes 


No 


No 


Initial 
Processors 
Required 
By STDS 


203 


75 


50 


50 


Maximum 
Processors 
Required 
By STDS 


1171 


342 


97 


50 


Average Ratio of 

STDS/Absolute 

Lowerbound 


1.095 


1.118 


1.315 


1.50 


Ratio of 
STDS/Absolute 
Lowerbound 
when maximum 
processors 
are available 


1.0004 


1.0 


1.0017 


1.0019 




Fig. 4.10. General Structure of Diamond DAGe 
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Fig. 4.13. DAG for Comparison of Algorithms 
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4.5 Comparison with Other Algorithms 

Finally, STDS algorithm has been compared to five other algorithms using 
the example DAG used in [16]. The example DAG, shown in Figure 4.13, 
is used for comparing the STDS algorithm with DSC algorithm [16], Linear 
Clustering algorithm [22], Internalization pre-pass [30], MCP clustering al- 
gorithm [33] and the Threshold scheduling scheme [28]. The schedule time 
generated by each of the algorithms and the algorithm complexities are shown 
in Table 4.2. It can be seen that STDS has the optimal schedule and has the 
lowest complexity among all the scheduling algorithms. 



Table 4.2. Comparison of Algorithms 



Algorithms 


Schedule length 


Complexity 


Optimal Algorithm 


8.5 


NP-Complete 


Linear Clustering 


11.5 


0(v{e + v)) 


MCP Algorithm 


10.5 


0{v^logv) 


Internalization Pre-Pass 


10.0 


0{e{v + e)) 


Dominant Sequence Clustering 


9.0 


0((e -|- v)logv) 


Threshold Scheduling Algorithm 


10.0 


0(e) 


STDS Algorithm 


8.5 


0(e) 



5. Conclusions 

A Scalable Task Duplication based Scheduling algorithm for DMMs has been 
presented in this chapter which operates in two phases. Initially linear clusters 
are generated and if the number of required processors for the linear clusters 
is more than the number of processors provided by the system, then the task 
lists are merged to reduce the required number of processors. On the contrary, 
if number of processors available is more than required number of processors 
then the surplus processors are used in an effort to reduce the schedule length 
and try to bring it as close to optimal as possible. 

The complexity of this algorithm is of the order of 0(4^4^) where 4>4Hs 
the number of nodes or tasks in the DAG. The results as obtained by its 
application to the randomly generated DAGs are very promising. Also, the 
STDS algorithm has been applied to large application DAGs. It has been 
observed that the schedule length generated by STDS algorithm varies from 
1.1 to 1.5 times the absolute lowerbound. It has been observed that the 
STDS algorithm performs very well in reducing the number of processors by 
increasing the schedule length gradually. 
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of Dynamic Data Structures 
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1. Introduction 

A wide range of applications make use of regular dynamic data structures. 
Dynamic data structures are typically required because either the size of the 
data structure cannot be determined at compile-time or the form of the data 
structure depends upon the data that it contains which is not known until 
runtime. Some commonly used regular data structures include link lists, trees, 
and two dimensional meshes. Lists and trees are used by applications such as 
the N-body problem [10] and sparse Cholesky factorization [8,11], dynamic 
meshes are used for solving partial differential equations and quad-trees are 
used by applications such as solid modeling, geographic information systems, 
and robotics [9]. 

Recently researchers have focussed on the translation of programs writ- 
ten using the shared-memory paradigm for parallel SPMD (single-program, 
multiple-data) execution on distributed- memory machines [1,6]. In this ap- 
proach the distribution of shared data arrays among the processors is specified 
as a mapping between array indices and processor ids. This mapping is ei- 
ther specified by the user or generated automatically by the compiler. The 
compiler translates the program into an SPMD program using the owner- 
computes rule in which parallelism is exploited by having all processors simul- 
taneously operate on portions of shared data structures that reside in their 
respective local memories. The mapping between array indices and processor 
ids is also used by the compiler to generate interprocessor communication 
necessary for SPMD execution. 

While the above approach is applicable to programs that use shared data 
arrays, it cannot be directly applied to programs with shared pointer-based 
dynamic data structures. This is due to the manner in which dynamic data 
structures are constructed and accessed. Unlike shared data arrays that are 
created and distributed among the processors at compile-time, dynamic data 
structures must be created and distributed among the processors at runtime. 
A mechanism for specifying the distribution of nodes in a dynamic data 
structure is also needed. Unlike arrays whose elements have indices, the nodes 
of a dynamic data structure have no unique names that could be used to 
specify a mapping of nodes in a dynamic data structure to processors. 

On a shared-memory machine pointers through which shared data struc- 
tures are accessed are implemented as memory addresses. Clearly this imple- 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 683-708, 2001. 
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mentation of a pointer is not applicable to a distributed-memory machine. 
Even if a global pointer representation was available, an additional problem 
exists. Once a dynamic data structure is distributed among the processors, 
the pointers that link the nodes would also be scattered across the proces- 
sors. Each processor must traverse these links to access the nodes in the 
data structure. Thus, the links must be broadcast to all processors creat- 
ing excessive interprocessor communication. Thus an efficient mechanism for 
traversing and accessing a dynamic data structure is required. 

In this chapter an approach for distributing and accessing dynamic data 
structures on distributed-memory machines is presented that addresses the 
problems described above. Language and compiler support for SPMD execu- 
tion of programs with dynamic data structures is developed. Extensions to a 
C-like language are developed which allow the user to specify dynamic name 
generation and data distribution strategies. The dynamic name generation 
strategy assigns a name to each node added to a dynamic data structure at 
runtime. A dynamic distribution strategy is a mapping between names and 
processor ids that is used to distribute a dynamic data structure as it is cre- 
ated at runtime. The compilation strategy used in this approach allows the 
traversal of a data structure without generating interprocessor communica- 
tion. The name assigned to a node is a function of the position of the node in 
the data structure. Each processor can generate the names of the nodes inde- 
pendently and traverse the data structure through name generation without 
requiring interprocessor communication. 

The subsequent sections first present the language constructs for declar- 
ing regular distributed data structures and specifying dynamic name gener- 
ation and distribution strategies. The semantics of pointer operations used 
to construct and manipulate local and distributed dynamic data structures 
is described. Compilation techniques for translating programs with dynamic 
data structures into SPMD programs are presented next. Extensions for sup- 
porting irregular dynamic data structures are developed. Some compile-time 
optimizations are also briefly discussed. Discussion of related work concludes 
this chapter. 



2. Language Support for Regular Data Structures 

In this section language constructs required for expressing processor struc- 
tures, declaring local and distributed dynamic data structures, and specifying 
dynamic name generation and distribution strategies are presented. The se- 
mantics of pointer related operations used for constructing and manipulating 
dynamic data structures are also discussed. 
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2.1 Processor Structures 

To express the processing requirements the user is provided with processor 
declaration statements shown below. In each of these declarations the user 
specifies the number of processors required. The selection of the declaration 
(TREE, ARRAY or MESH) determines the topology in which the processors 
are to be organized. Neighboring processors in the given topologies are con- 
sidered to interact frequently. Thus, it is preferable that these processors are 
directly linked. The advantage of specifying topologies is that the subset of 
processors that are allocated to the program can be selected to match the 
communication patterns expected to occur frequently in the program. Good 
embeddings of the available topologies into the physical topology of the sys- 
tem, for example a hypercube, can be predetermined. Routing techniques 
for different topologies can be implemented once and used at runtime by all 
applications. In the declarations below, in case of a TREE the degree of each 
node and the total number of processors is specified by the user and in the 
cases of ARRAY and MESH structures the number of processors along each 
dimension are specified. 

processor TREE ( degree; num ); 

processor ARRAY ( num ); 

processor MESH ( numi; num 2 ); 

Each processor in a processor structure is also assigned a logical id denoted 
as n which contains one or more fields (tti, tt 2 ... iTm)- The number of fields 
chosen should be appropriate for uniquely identifying each processor in a 
given processor topology. The fields in the processor id have positive non-zero 
values. The processor id Uq (=(0,0,..)) will never be assigned to any processor 
and therefore it is used to indicate a null processor id. The mapping between 
the names {'P) of data items and processor ids (77) will determine the manner 
in which the data structure is distributed among the processors. We assume 
that processor ids of the ARRAY and MESH topologies are represented by 
a single integer and a pair of integers respectively. The processor id of the 
TREE topology consists of two integers, the first identifies a level in the tree 
topology and the second identifies the specific processor at that level. 

2.2 Dynamic Data Structures 

Dynamic data structures are declared using pointer variables as in commonly 
used languages such as C. The user can declare a base type which describes 
the nodes used in building a data structure. From this base type the user can 
derive new types which represent simple local and distributed data structures. 
In addition, complex hierarchical data structures may also be specified. Each 
level in the hierarchy of such a data structure is a simple local or distributed 
data structure. By declaring variables of the above types the user can then 
write programs that construct and manipulate dynamic data structures. 




686 Rajiv Gupta 



The nodes of a distributed data structure are spread among the processors 
using a distribution scheme specified by the user. All nodes belonging to 
a local data structure either reside on a single processor or are replicated 
on all processors. If a local data structure is not a part of any distributed 
hierarchical data structure, then it is replicated on all processors. However, 
if the local data structure is a substructure contained within a distributed 
level of a hierarchical data structure, then all nodes in the local substructure 
reside on the single processor at which the parent node containing the local 
substructure resides. 

In this work SPMD execution is based upon the owner-computes rule. 
Thus, a statement that operates on a non-replicated data structure is exe- 
cuted by a single processor. On the other hand statements that operate on 
replicated local data structures are executed by all processors. As a result 
any modification made to a replicated local data structure is made by each 
processor to its local copy. Therefore at any given point in the program the 
copies of a local data structure as seen by different processors are identical. 

The type of a pointer variable is used by the compiler to determine how 
the data structure associated with the pointer variable is to be implemented. 
In addition to declaring pointer variables representing local and distributed 
dynamic data structures, the user may also declare temporary pointer vari- 
ables of the base type to assist in the traversal of dynamic data structures. 

There is one major restriction that is imposed on distributed data struc- 
tures. A node may belong to a single distributed data structure at a given 
time. This restriction allows the processor at which the node resides to be 
uniquely determined. If a node belonging to a local or distributed data struc- 
ture is to be added to another distributed data structure, the user must ex- 
plicitly copy the node from the former to the latter. In the remainder of this 
section the type and variable declarations and the semantics of operations 
involving pointer variables is described in greater detail. 

The type declaration of a pointer type is augmented with an attribute 
which specifies whether the data structure built using this type is to be 
distributed across the processors or it is simply a local data structure. 

type NodeType is ^ distributed 4»Local <|)BaseType; 

Doubly linked data structures are viewed as being derived from singly 
linked structures created by associating with a unidirectional link, a link 
in the reverse direction. The creation of a link from one node to the next 
automatically creates the reverse link. In a language like C a doubly-linked 
data structure is implemented by introducing two distinct links that must 
be explicitly manipulated by the user. This is because C does not provide a 
language construct for expressing reverse links. Given a link ptr, the presence 
of a reverse link revptr is expressed using the reverse construct shown below. 
The reverse pointers, such as revptr, can be only used for traversing a data 
structure. The creation of new nodes must be carried out using the original 
pointer ptr. 
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*pointertype ptr; 
reverse (ptr,revptr); 

A pointer variable in a program may be of a base type, a local data 
structure type, and a distributed data structure type. These variables are used 
to create new nodes, traverse existing data structures, access data stored in a 
data structure, and manipulate data structures through pointer assignments. 
The pointer variables are classified into three general categories: handles, 
constructors, and temporaries. The semantics of pointer related operations is 
based upon the category to which a pointer variable belongs. 

A handle is a pointer variable which is of the type representing a local or 
distributed data structure. If a node is created using a handle it is assumed 
that this is the first node in the data structure. If the handle points to a local 
data structure, then a node is created at each processor so that a replicated 
structure can be built. However, if the local data structure is a substructure of 
a distributed data structure, then the node is created at the processor where 
the containing node of the distributed structure resides. If the handle points to 
a distributed structure, the distribution scheme will enable the identification 
of the processor at which the node is to be located. On the other hand if the 
current data structure is a substructure of another distributed data structure, 
then the processor at which the node resides will be inherited from the parent 
node. Since a handle is essentially a global variable that provides access to a 
distributed data structure, it is replicated on all processors. 

The pointer variables that are fields of a node belonging to a local or 
distributed data structure are called construetors. These pointer variables are 
used to connect together the nodes belonging to the data structure. When an 
additional node is added to a data structure using a constructor, it is known 
that the data structure is being expanded. In the case of distributed data 
structures this information enables us to ascertain the processor at which the 
new node is to be located. 

Temporary pointer variables assist in the traversal and construction of a 
pointer-based data structure. A temporary is declared to be of a base type. If 
a temporary points to a node in a local or distributed data structure it can be 
used to traverse the data structure. The same temporary can be used to tra- 
verse a local data structure in one part of the program and a distributed data 
structure in another part of the program. It can be used to create and initial- 
ize single nodes which can be added to a local or distributed data structure. 
After the node has been added to a data structure the temporary pointer 
variable can be used to traverse the data structure. The type of a temporary 
pointer varies from simply being a base type to a local or distributed data 
structure type derived from the base type. Since a temporary may be used 
by all processors to traverse a distributed data structure, it is replicated on 
all processors. A node created using a temporary can be added to the data 
structure. 
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The above discussion was primarily limited to a single data structure. 
Constructors and handles also play an important role in programs involving 
multiple data structures. In particular, they can be used to split or merge 
distributed data structures. For example, given a distributed link list, let us 
consider a constructor pointing to an element of the link list. If this con- 
structor is assigned to another handle of a distributed link list type, then the 
original list is split at the position of the constructor and now forms a list 
corresponding to the handle to which the constructor is assigned. Thus, the 
above process results in splitting of a distributed link list into two distributed 
lists. The merging of distributed lists can be achieved by assigning the value 
of the handle corresponding to one list to the constructor within the last list 
element of the other list. 

2.3 Name Generation and Distribution Strategies 

Each element of a pointer-based data structure is assigned a name. A name 
denoted as 'P' will contain one or more fields (■^i, ^2 ••• V’n)- The manner 
in which these names are to be derived is specified by the user; hence the 
number of fields appropriate for describing the data structure is also chosen 
by the user. The name generation strategy must specify the name of the first 
node in the data structure. In addition, given a node n with name iP" the 
computation of names of nodes directly connected to n are also specified. In 
other words the computation of the name of each node n ^ ptri denoted as 
'l/.ptri is also specified. If the link represented by ptri is a bidirectional one, 
then the computation of 'P.revptri from 'P is also specified. The construct for 
specifying the naming strategy for a distributed data structure DistType is 
shown below. 

name DistType ^ 

first: ( To ); 

T.ptri'. fi{T); 

T.revptri: /®^(<F); 



T.ptr^: /„(>P'); 
T.revptr„: 

0 



This mapping of a distributed data structure type on a given processor 
topology is specified by the user. The construct for specifying the distribution 
strategy of a distributed type DistType on a processor topology ProcStruc is 
shown below. 

distribute DistType on ProcStruc ^ 

n = {tvi = gi{T), 7T2 = g 2 i'T), tt, 



0 



= 9m{'T) ) 
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2.4 Examples 

The constructs described in the preceding sections are illustrated through 
three examples: a tree, a two dimensional mesh, and a list of link lists (an 
example of a hierarchical data structure). These examples demonstrate that 
distribution strategies similar to cyclic and block-cyclic distributions used for 
static arrays can be expressed for dynamic data structures using the name 
generation and distribution constructs described in the preceding section. 

First let us consider the specification of a distributed tree shown below. 
A general name generation and distribution strategy applicable to trees of 
an arbitrary degree n are presented. 

type Node is distributed 
struct ^ 

int data; 

*Node ptri, ptr 2 , ... piVn 

0 , 

var *Node atree; 

The naming strategy that is presented next assigns a single integer as a 
name to each node in the tree data structure. The root of the tree is assigned 
the name one, the successive nodes at a given level are assigned consecutive 
numbers as their names, and lower numbers are assigned to lower levels in 
the tree. Fig. 2.1 shows the names of the nodes. 




Fig. 2.1. Naming strategy for the distributed tree data structure. 



The name (>f^) assigned to a node is simply the position of the node in 
the distributed tree data structure. The name of any other node P ptri is 
computed from the name of node P. Implementation of a doubly-linked tree 
structure requires an additional function for computing the position of node 
P revptvi from the position of node P. The function which computes the 
name for node P — > revptVi from the name of node P is the inverse of the 
function that computes the name of node P ptvi from the name of node 
P. 
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name Node ^ 
first: 0 



'l/.ptri’. n (•Z' 0 1) + i + 1 
’P.revptTi: 

0 

In order to describe the distribution of a tree on a processor structure, 
both the processor structure and the dynamic data structure are viewed 
as consisting of series of levels. Successive levels of the data structure are 
mapped to successive levels of the processor structure. The root nodes of the 
data structure are created on the first level of the processor structure. If the 
number of levels in the data structure is greater than the number of levels in 
the processor structure, multiple levels of the data structure will be mapped 
to a single level of the processor structure. The children of a node assigned 
to a processor at the last level in the processor structure are assigned to 
processors at the first level of the processor structure. The nodes in the data 
structure at a given level are distributed equally among the processors in 
the corresponding level in the processor structure. In this scheme a processor 
id is represented as U = (p,l) and it refers to the processor at level 1. 
The function Procid given below returns the processor at which a node with 
position n should be located. 

distribute Node on ProcStruc ^ U = ProcId{’P) 
function ProcId( P ) ^ 

/* function returns processor id II to which node I' is mapped */ 

/* NumLevels (Proc^tntc) - number of levels in ProcStruc */ 

/* NumProcs(f) - number of processors at level I of ProcStruc */ 
current!^/ = currentlevel = 0; 
while (true) ^ 

currentlevel++; 

I = (currentlevel - 1) mod NumLevels{ProcStruc) + 1; 

j = ^ciirrentZei;e/01 ^ 

p = (j-l) mod NumProcs{l) + 1; 

current >?"++; 

if (S' == current!?") ^ return { p, I ) I} 

m 

A desirable property of the above strategy is that it tends to divide a 
data structure into pieces which are identical to the processor structure. For 
example, if a binary tree data structure is mapped to a fixed size binary 
tree of processors, this strategy divides up the data structure into binary 
trees of the size of the processor structure. This is illustrated in Fig. 2.2 by 
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the distribution of a binary tree data structure on a binary tree processor 
structure with seven processors. 



1 




Consider the specification of the mapping of a data structure representing 
a two dimensional mesh to a n | n processor mesh. Simple mapping functions 
which assign a unique name to each node and distribute the data structure 
among the processors are given below. Both the names and processor ids are 
specified as integer pairs. The name of a node is the same irrespective of the 
parent node through which it is computed and hence it is unique. 

constant n = 2; 

processor MESH ( l:n, l:n ); 

type 

Node is struct ^ int data; *Node left, right •C; 

DistMesh is distributed Node] 

naime DistMesh ^ 

first: {tpi = 0, ip 2 = 0); 

'I'.left: (V’l + 1, V' 2 ); 

^.right: {t/ji, -02 + 1 ); 

0 

distribute DistMesh on MESH ^ 

77 = ( 7Ti = 01 mod n, 7T2 = 02 mod n ) 

0 

var * DistMesh amesh; 

Eig. 2.3 illustrates the division of a mesh among the processors in MESH 
(1:2, 1:2). As one can see, like the distribution strategy for the tree, this 
distribution strategy also divides the mesh data structure into smaller pieces 
corresponding to the mesh processor structure. 




692 Rajiv Gupta 




Mesh data structure 



Fig. 2.3. Distributed mesh data structure. 

The final example is that of a hierarchical data structure. The data struc- 
ture represents a list of link lists. Lists at two levels in the hierarchy are 
distributed along the two dimensions of the MESH processor structure. The 
lists at the second level of the data structure are doubly linked. The naming 
strategy used by the two types of lists is the same and consists of consecutive 
numbers. The distribution of the top level list assigns consecutive pairs of 
list elements to consecutive processors on the first column of processors. The 
distribution of a list contained within an element of the top level list is car- 
ried out in pairs along the processors belonging to a row in the MESH. The 
processor row number to which such a list is assigned is inherited from the 
containing element of the top level list. This is indicated by the assignment 
of an asterisk to tti in the distribution strategy for List. In Eig. 2.4 the names 
of the nodes are written inside the nodes and the integer pairs are processor 
ids at which the nodes reside. 

constant n = 2; 
processor MESH (l:n, l:n); 
type 

List is distributed 

struct ^ int data; *List next; reverse (next,prev) 

Node is distributed 

struct ^ *List alist; *Node down 0> 
var *Node ListOfLists; 

name List ^ first: 1; ^.next: t/1 -|- 1; t/j.prev: - 1 <|) 

name Node ffirst: 1; -(/j.down: ip -\- 1 
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distribute List on MESH ^ n={ni=*, tt2= mod n + 1 ) <^ 

distribute Node on MESH ^ II={'Ki= {'^/2/\) mod n + 1, 7T2 = 1) •() 



(U) (1,2) 




3 

( 2 , 2 ) 

Fig. 2.4. A list of link lists. 




( 2 , 1 ) 



The distribution of the tree data structure and the mesh data structures 
presented in this section are analogous to the cyclic distribution schemes used 
for static arrays. In contrast the distribution specified for a list of link lists 
is similar to a block-cyclic distribution used for distributing static arrays. 



3. Compiler Support for Regular Data Structures 

In this section first a representation of pointers that is suitable for distributed 
memory machines is developed. This representation is based upon names that 
are global to all processors rather than simple memory addresses that are lo- 
cal to individual processors. The translation rules for operations involving 
dynamic data structures to enable owner-computes rule based SPMD execu- 
tion are described in detail. 

3.1 Representing Pointers and Data Structures 

On a shared memory machine a pointer variable is simply the memory address 
of the data item that it points to. To accommodate distributed data structures 
the notion of pointers must be generalized so that they can be used for the 
distribution of a data structure across the processors in the system. A pointer 
variable P which points to a node in a distributed data structure is represented 
as ds < '1' : [U,a] >, whose components are interpreted as follows: 
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P.ds: uniquely identifies the data structure; 

P.if": is the name of the node in the data structure ds to which P points; 

P.U: is the processor at which the node resides; and 

P.a: is the address in processor P.TT’s memory at which the node resides. 

The pointer ds < 0 : [i7o,0] > represents a nil pointer and P.a. data 
denotes the address of the data field data of the node P points to. When a 
new node is added to a distributed data structure its name d/ is computed 
from the names of its neighbors using the user specified name generation 
strategy. From the name of a node the processor 77 at which it resides is 
computed. 

Each processor represents a distributed data structure as a hash table 
which is identified by ds. The hash table is initially empty and a nil assign- 
ment to the handle of a distributed data structure clears the hash table. The 
processor 77 is computed from d/ only when the node is initially created. Each 
processor stores the information that the node iF exists and that it resides 
at processor 77 in the hash table. In addition, the processor 77 also stores 
the address of the node a. The node name iF is used to access and store the 
information. The runtime mechanism for identifying the processor that will 
execute a given statement uses the information stored in the hash table when 
the statement computes a value in a node of the dynamic data structure. The 
same information is also used for generating interprocessor communication 
required for the execution of the statement under owner-computes rule based 
SPMD execution. 

The hash table is manipulated using the following functions: PutNodefds, 
'I' ) stores information, GetNode(ds,'I') retrieves information, DeleteNode(ds, 
W ) deletes nodes in the hash table, and ClearTable(ds) reinitializes the hash 
table. It should be noted that although the hash table is replicated at each 
processor the nodes in the data structure are not replicated. The hash table 
representation enables us to avoid interprocessor communication during run- 
time resolution at the expense of storage at each node. 

Temporary pointer variables and pointer variables of a local data structure 
type are also represented as ds : [77, a] >. In case of local data structures 
ds is not a hash table but rather the address of the first node in the data 
structure and the field is always zero. The field 77 is IJq if the structure is 
replicated at all processors; otherwise it is the id of the processor at which 
the local structure resides. Finally, a is the local address of the node in the 
processor’s local memory. In the case of temporaries initially fields ds, <7, and 
77 are zero and if the temporary points to a single node created through it, 
then a is the address of that node. However, if the temporary points to a 
local or distributed data structure its fields take the form associated with 
local and distributed pointer types. The W and 77 values of a temporary can 
be examined to discern whether it currently points to a single node which 
does not belong to any data structure, a local data structure, or a distributed 
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data structure. This distinction is important since it determines the semantics 
of the operations performed using a temporary pointer variable. 



3.2 Translation of Pointer Operations 

The translation of all types of operations that are performed on pointer vari- 
ables is discussed in this section. The semantics of each operation depends 
upon the categories of the pointer variables involved in the operation. Ini- 
tially all pointers are nil. When a new node is created using a pointer variable, 
the node’s representation is computed and assigned to the pointer variable. 
As mentioned earlier, the nodes in a distributed data structure are referred 
to by their names and a distributed data structure is traversed through the 
generation of these names. The order in which the node names are gener- 
ated corresponds to the order in which the nodes can be accessed using the 
links in the data structure. The compiler uses the following communication 
primitives during code generation: 

— the operation SEND(pid,addr) sends the value stored at the address addr 
on the executing processor to processor pid; 

— the operation global send GSEND(addr) sends the value stored at address 
addr on the executing processor to all other processors; and 

— RECV(addr,pid) receives a value from pid and stores it at the address addr 
on the executing processor. 

In this section /, and g are used to denote a name generation function 
of a regular pointer, a name generation function of a reverse pointer and a 
distribution function. 

Translation of Node Creation Operations 

Nodes can be created using the new operation on pointer variables. First the 
semantics and translation of the new operation for handles and temporaries 
is discussed and then the constructors are considered. 

new(P) 

P is a distributed handle: 



If a node is created using a handle to a distributed data structure it is assumed 
to be the first node in the data structure whose name can be found from the 
name generation scheme provided by the user. From the name of the node the 
processor at which the node will reside is determined from the distribution 
strategy provided by the user. In case the data structure is a substructure of 
a hierarchical data structure and the containing structure at the preceding 
level in the hierarchy is also a distributed structure, then a processor offset 
is computed based upon the containing data structure. This offset is used 
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in conjunction with the user specified mapping strategy to determine the 
processor at which the node is to reside. 

In order to implement the above strategy with each data structure an 
attribute ds. II inherit is associated which is set at runtime when the enclosing 
node in an hierarchical data structure is created. If there is no enclosing node 
then, it is set to IIq. The processor offset associated with a data structure ds, 
denoted as ds.IIoffset, is computed from ds. II inherit and the user specified 
distribution strategy. For data structures that do not require any offset (eg., 
simple data structures that are not contained in any hierarchical structure), 
ds.IIoffset is simply 77o. Since a handle is replicated at each processor all 
processors carry out the assignment. However, only the processor at which 
the node is to reside allocates storage for the node. The hash table is updated 
to indicate the state of the data structure. 

Clear Table(P.ds); P.lF = first; 
if handle P is contained in a distributed node 
^ P.ds.IIoffset = Ilinherit ® g{P-'I') 0 
else ^ P.ds.IIoffset = Ho 0 
F.n = g{P.I') + P.ds.IIoffset; 

if (mypid == P.II) ^ P.a = malloc ( SizeOfNode ) Oelse ^P.a = nil 0 
PutNode(ds,P.lF)=(exists=true,processor=P.77,address=P.a); 

Initialize Ilinherit for all enclosed data structures; 

P is a local handle: 

If a node is created using a handle to a local data structure the value of I' 
is zero. If the local data structure is not a substructure of a distributed data 
structure, then all processors allocate storage since a local data structure is 
replicated at each processor and each processor assigns its own id to 77. On the 
other hand if the local structure is a substructure of a distributed structure, 
then only the processor at which the containing node resides allocates storage 
for the structure. 

P.T' = 0; P.77 = mypid; 

if ((P.77 == mypid) or (P.77 == 77o)) 

^ P.a = malloc ( SizeOfNode ) 0 



P is a temporary: 

Finally in case of a temporary, all processors allocate storage for a node in 
their respective local memories. The translation rules described above are 
summarized in code below. In this code mypid represents the id of the pro- 
cessor executing the code. A temporary is replicated at each processor. A 
temporary is assigned 0 < 0 : [77o,q;] > where a is the address of the node. 
The storage is allocated by each processor in its local memory. 
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P.ds = 0; P.\I/ = 0; P.U = Uq] P.a = malloc ( SizeOfNode ); 
new(P ^ ptTi) 

P points to a node in a local data structure: 

If the nodes are created using constructors belonging to local data structures 
a new node is created and added to the data structure as shown below. In ad- 
dition, if the link involved in the operation is a bidirectional link then code is 
generated to create a bidirectional link. The value of P.U determines whether 
the local data structure is a replicated or an unreplicated data structure. 

if ((P.iT == mypid) or (P.iT == TTo)) If addr = malloc ( SizeOfNode ) <> 
else ^ addr = nil •() 

store pointer 0 < 0 : [P.Il^addr] > at P.a.ptri; 

if ptvi is bidirectional ^ store 0 < 0 : [P.II, P.a] > at addT.revptri <f) 

P points to a node in a distributed data structure: 

The creation of a node using a constructor belonging to a distributed data 
structure requires the computation of the name of the new node and assign- 
ment of this node to a processor. CreatePtri{ds, P) computes and returns the 
pointer representation of the newly created node. It makes use of the name 
generation and distribution schemes for assigning the node to a processor. 
Recall that a node cannot be created using a reverse pointer P revptr. 

P ptri = CreatePtri{P.ds, P.P) 
function CreatePtri ( ds,>Z^ ) ^ 

P = /(>!'); n = g{<P) + ds.Uoffsef, 

if (mypid == 7/) ^ a = malloc ( SizeOfNode ) •Oelse ^ a = 0 0 
PutNode(ds,?7') = (exists = true, processor = II, address = a) 
return( ds < P : [77, o] > ) 

0 

Translation of Traversal Operations 

References must be made to the pointer fields of the nodes in the data 
structure in order to traverse the data structure. In the case of distributed 
data structures these references are transformed so that the modified code 
performs the traversal through node names using TraversePtri{ds,P) and 
TraverseRevptri{ds,P) functions. On the other hand the traversal of local 
data structures does not require the computation of names. 

P ptVi {P revptr i) 



P points to a node in a local data structure 
if ((P.77 == mypid) or (P.77 == 77o)) 
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^ V.a.ptTi {F.a.revptri) is the address at which the node is stored •() 

P points to a node in a distributed data structure 
Traver sePtri (P. ds , P. '?' ) ( TraversePeupt (P. ds , P. ) 
returns the representation of P ^ ptvi (P ^ revptri) 

function TraversePtri ( ds,'!' ) ^ 

^ 

(exists, n, a) = GetNode(ds,'Z') 
if exists ^ return { ds < 'I' : [P, a] > ) 0 
else ^ return { ds < 0 : [Poi 0] > ) ■() 

0 

function TraversePeuptri ( ds,'?' ) ^ 

^ 

(exists, n, a) = GetNode(ds,i?) 
if exists ^ return { ds < d/ : [U, «]>)'() 
else ^ return { ds < 0 : [Poi 0] > ) •() 

Translation of Pointer Assignments 

The building and modification of a dynamic data structure is carried out using 
pointer assignments. First the affects of a nil assignment are considered and 
then assignment of one pointer variable to another is considered. 

P = nil 

A nil assignment to a handle (local or distributed) essentially eliminates the 
nodes in the entire data structure. On the other hand, a nil assignment to 
a temporary, or a constructor accessed through a temporary, that points 
to a node in a data structure eliminates the node. A nil assignment to a 
temporary variable which does not point to any other data structure simply 
sets the pointer to nil and docs not have an affect on any other data structure. 
In summary a major function of nil assignments is to delete one or all nodes 
from a local or distributed data structure. 

P is a temporary that does not point to a data structure 
P = 0 < 0 : [Po,0] > 

P is a distributed constructor 
if (P.n == mypid) 

^ delete node corresponding to P.'P from hash table for P.ds <)> 

P = ds < 0 : [Po ) 0] > 

P is a local constructor 
if {(P.n == mypid) or (P.P == Pq)) 

^ delete node associated with P.>? •() 
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P = ds < 0 : [ilo; 0] >; 

P is a distributed handle 

Clear Table(P.ds); P = ds < 0 : [iJo)0] > 

P is a local handle 

if {{P.n == mypid) or (P.77 == IJq)) 

^ delete entire local structure <0 

P = ds < 0 : [TJo, 0] >; 

P = Q 

An assignment of one pointer variable to another pointer variable can have 
the following effects depending on the types of the pointer variables involved. 
A pointer assignment can result in simple copying of a pointer, addition of 
a node to a data structure, splitting of a distributed data structure into two 
distinct data structures, and merging of two data structures into one. These 
actions are described next in greater detail below. 

Copying a pointer: The copying of pointers occurs when dealing with local 
data structures or when the assignment is made to a temporary variable. If P 
is a temporary pointer variable, which currently does not point to any data 
structure, then after it is assigned Q, both P and Q point to the same place. If 
both P and Q are pointers associated with local data structures (i.e., a local 
handles or constructors) , then after the assignment P also points to the same 
address as Q. Assignments to handles of local data structures may result in 
addition and deletion of nodes from the data structure that it represents. 

P. ds = Q.ds; P.!7 = Q.f'; P.77 = Q.77; P.a = Q.a 

Adding a single node: Consider the situation in which Q is a temporary 
pointer variable through which a single node has been created. This node can 
be added to a local or distributed data structure by assigning Q to P, where 
P is a handle or constructor of a local or distributed data structure. After the 
assignment, Q behaves as a pointer to the data structure to which the node 
has been added. Notice that since the node associated with the temporary 
pointer Q is replicated at each site, no communication is required during this 
operation. 

if ((P.77 == mypid) or (P.77 == 77o)) 

^ P.a = Q.a 0 

Q. <lr = Q,77 = p,77; 

PutNode(P.ds,P.>7) 

Restructuring a distributed data structure: If both P and Q point to the 
same distributed data structure and P is not a constructor, then assignment 
of Q to P simply results in pointer copying. However if P is a constructor, 
then the data structure is restructured. The node pointed to by Q now takes 
the position that P points to, and all nodes connected directly or indirectly 
through constructors originating at Q are also shifted to new positions. After 
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restructuring, Q also points to the position to which the constructor P points. 
The restructuring is performed by function Reposition which also generates 
interprocessor communication to transfer nodes to their new sites. 

Reposition(Q.ds,P.ds,Q.!P',P.!Z'); 

Q ,77 = p,77; Q.a = F.a; 



function Reposition(oldds,newds,old'Z',newtft) ^ 

nodelist = ^(old!f',new!^")<^\^(old,new): old is derived from oldl^O 
for each pair (old, new) ~ nodelist st 

(mypid=g(old)) and (mypid=t^(new)) ^ 

(exists, iJ, a) = GetNode(oldds,old); DeleteNode(oldds,old); 
SEND(g(new),o;); 

PutNode(newds,new) = (exists=true,processor=g(new),address=0) 

❖ 

f oreach pair (old, new) ~ nodelist st 

(mypid3^^(old)) and (mypid=g(new)) ^ 

a = malloc( SizeOfNode ); DeleteNode(oldds,old); 
RECV(a,g(old)); 

PutNode(newds,new)=(exists=true,processor=mypid,address=a) 

❖ 

f oreach pair (old, new) ci nodelist st 

(mypid=g(old)) and (mypid=g(new)) ^ 

(exists, 77, a) = GetNode(oldds,old); DeleteNode(oldds,old); 
PutNode(newds,new)=(exists=true,processor=mypid,address=a) 

❖ 

Update temporaries pointing to repositioned nodes 

0 



Merging and splitting of distributed data structures: Assignments involv- 
ing handles and constructors belonging to different data structures, at least 
one of which is distributed, results in transfer of nodes between the data 
structures. If P is the handle of a data structure and it is assigned Q which 
is a handle or a constructor of another data structure, then all nodes starting 
from Q’s position are transferred to the data structure corresponding to P. 
Interprocessor communication must be generated to reposition the nodes at 
their new sites. If nodes are transferred from a distributed data structure to 
a replicated local data structure, then the above operation will essentially 
result in broadcast of the data structure to all processors. The restructuring 
is performed by function Reposition. 



Translation of Node Data Accesses 

Accessing data stored in local data structure is straightforward if it is repli- 
cated at each processor. However, in case of data in non-replicated data 
structures or distributed data structures, first the processor at which the data 
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resides is located. Next either to store or retrieve data, interprocessor com- 
munication may be required. In order to perform these accesses instructions 
LOAD and STORE are defined. The LOAD operation copies the contents of 
data field data of the node pointed to by the pointer P into a scalar t lo- 
cated at processor pid. The STORE operation copies the contents of a scalar 
t residing at processor pid into the data field data of the node pointed to 
by the pointer P. In the LOAD/STORE operations defined below the scalar 
variable t is a temporary that resides at processor pid. 

LOAD t, pid, P, data == tpid = P^data 

if ( F.d' == F.a == 0 and P./I == /Jo ) 51 error - nil pointer •(> 
elseif ((P.!l == 0) and (P./I == //q)) H 

- P points to a replicated local data structure 
if (mypid == pid ) H t = P.a.data 0 

•Oelse ^ - P points to a non-replicated local or distributed structure 
if (P./J == mypid == pid) H t = P.a.data 0 
elseif (P.// == mypid =^pid) ^ SEND(pid,P.a.data) <0> 
elseif (P./I ^mypid == pid) RECV(t,P./7) 0 
else the processor is not involved in the execution because 
P./I mypid =^=pid •() 

❖ 

STORE t, pid, P, data == P^data = tpu 

if ( P.iJ == P.a == 0 and P./I == /Jq ) 1) error - nil pointer •(> 
elseif ((P.!l == 0) and (P./I == //q)) II 

- P points to a replicated local data structure 
if ( mypid == pid ) ^ GSEND(t) 0 

else RECV(P.a. data, pid) 

•Oelse ^ - P points to a non-replicated local or distributed structure 
if (P./T == mypid == pid) ^ P.a.data = t 0 
elseif (P.// == mypid =^=pid) ^ RECV(P.a.data,p/(/) 0 
elseif (P./I ^mypid == pid) SEND(P./I,t) 0 
else the processor is not involved in the execution because 
P./7 ^ mypid *^=pid •O 

0 

An Example 

The translation of a small code fragment using the rules presented in this 
section is demonstrated next. The code fragment shown below traverses a 
distributed link list of integers. The integer value in each of the elements of 
the link list is incremented. 

constant n = 4; 
processor ARRAY (n); 

type Node struct ^ int data; *Node next •0 
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List is distributed Node; 
naime List ^ first: 1; -i/;.next: xp+l C; 
distribute List ARRAY ^ tt = (-i/) 0 l)modn + 1 C? 
var *List alist; *Node tlist; 

tlist = alist; 

while ( tlist ^=Nil ) ^ 

tlist^data = tlist^data + 1; 
tlist = tlist ^next 

0 



The loop for the traversal of the linked list is transformed into the loop 
shown below which will be executed in parallel by all processors in ARRAY. 
Under SPMD execution each processor will be responsible for incrementing 
list elements that reside in its local memory. As we can see the traversal is 
achieved through name generation and hence it does not require interpro- 
cessor communication. When the entire list has been traversed, the function 
TraverseNext{ds,^p) returns nil. The temporary pointer tlist used to tra- 
verse the link list is replicated on all processors in ARRAY. 

tlist = alist; 

while (tlist != nil) ^ 

/* tlist^data = tlist^data + 1 * / 

LOAD t, tlist. 77, tlist, data 

p, tlist. 77 = p, tlist. 77 -|- 1 

STORE p, tlist. 77, tlist, data 

/* tlist = tlist^next * / 

tlist = TraverseNext (tlist. ds, tlist. i/;) 

0 

function TraverseNext ( ds,t/> ) ^ 

'!/' = ■*/’ + ! 

(exists, a) = GetNode(ds,V’) 
if exists ^ return ( ds < ^ : [tt, a] >) 0 
else ^ return ( ds < 0 : [ttq, 0] > ) 0 

0 

function CreateNext ( ds,t/> ) ^ 
tp = tp + 1 
7T = ProcId( tp ) 
if (mypid == tt) 

^ a = malloc ( SizeOfNode ) 0 
else ^ a = 0 0 
PutNode(ds,'i/i) = 

(exists = true, processor = tt, address = a) 
return( ds < ip : [n, a] > ) 



0 
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function ProcId( ip ) 

^ return ( (^/) - 1) mod 4+1)0 



4. Supporting Irregular Data Structures 

In this section we present extensions of the approach described for regular 
data structures so that it can also be used to support irregular data struc- 
tures. In the examples considered so far the data structures considered were 
regular in nature and the name generated for each data item in the data 
structure was unique. In case of irregular data structures it is not possible to 
precisely specify the topology of the data structure and hence the name gen- 
eration strategy. One possible approach that can be employed approximates 
an irregular data structure by a regular data structure for the purpose of 
computing the positions and names of the nodes. A consequence of such an 
approximation is that multiple names may be created for nodes with multiple 
parents. Thus from the perspective of our approach, regular data structures 
may be viewed as unaliased data structures and irregular data structures 
may be viewed as aliased data structures. The following language construct 
allows the user to communicate to the compiler as to whether a distributed 
data structure is aliased or unaliased. 

type NodeType is ^ aliased <l|kinaliased •Odistributed BaseType; 

As an example, consider a distributed directed acyclic graph (DAG) shown 
in Fig. 4.1. In the mapping process the DAG has been approximated by a 
binary tree and mapped to a binary tree of processors using the strategy 
described previously. This process causes the nodes in the DAG with multi- 
ple parents to have multiple names. A different name is used to access the 
node depending on the parent through which traversal is being performed. 
The shared node can be accessed using node name 5 through node 2 or 
using node name 6 through node 3. When a node is initially created it is 
assigned to the processor indicated by the node name. When additional links 
are created to point to the same node (as by the assignment statement in 
Fig. 4.1), an alias is created with each additional link while the location 
of the node remains the same. For the DAG shown in Fig. 4.1, the shared 
node will be located on processor P 5 , if it is created by executing the state- 
ment ”new{P leftchild rightchild)'" . If it is created using statement 

”new{P rightchild leftchild)’^ then it will be located at processor Pq. 
From the above example it is clear that an irregular data structure must be 
declared as an aliased data structure. Although in this example we consid- 
ered an acyclic data structure, in general this approach handles cyclic data 
structures as well. 

In order to implement aliasing, the node information stored at each pro- 
cessor should be augmented to indicate whether the node is an alias of an- 
other node. If the node is an alias, the field orig'P will contain the position 
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of the node of which the current node is an alias, otherwise origW is zero. 
A pointer assignment which assigns to a handle/constructor another pointer, 
that points to a node in the data structure, causes the creation of an alias. A 
nil assignment causes the destruction of a name. If the destroyed name has 
aliases one of these aliases must be chosen as the new name for the node and 
the node information of all aliases is updated. Thus, for each node we must 
also maintain the list of its aliases. The field aliases contains this information. 
In addition we must also traverse the graph and change the names of all nodes 
that were derived from the destroyed name. This, processor is quite similar 
to the restructuring process described for regular data structures. However, 
there is one important difference. Although the node names are modified the 
nodes are not migrated. This is because for an aliased data structure it is not 
clear whether the migration will result in a superior distribution of nodes. 



:<Ss/“a 3® 




new (P->leftchild->rightchild) 
P->rightchild->leftchild := P->leftchild->rightchild 




new (P->rightchild->leftchild) 
P->leftchild->rightchild := P->rightchild->leftchild 

Fig. 4.1. A directed acyclic graph. 



Creation of an alias 



P = Q, where P is a distributed handle/constructor: 

PutNode(P.ds,P.!?') = 

(exists=true, processor=QJT, address=Q.a, origif'=Q.!?', aliases=null) 
add to the list of aliases of orig'I' another alias P.'Z' 
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Destruction of a name 



P = nil, where P is a distributed handle/constructor: 
let P. aliases = Ifai ,a2,...an0 
select oi as the new name for the node 
PutNode(P.ds,ai) = (exists=true, processor=P.77, 
address=P.a, orig'f'=0, aliases=^a2,...a„0) 
for 02 ....a„ ^ 

origtf' = ai; aliases = aliases - 

traverse data structure using P.'P and 

find all node names that were derived from P.>P 

recompute the position field of these nodes from a\ 

0 

The Traverseptri{ds,'P) and TraverseRevptri{ds,'P) must be modified 
so that when nodes are traversed using their aliases, the aliases are replaced by 
the original names. This, is important because if this is not done then creation 
of a node’s alias will require the creation of aliases for all nodes reachable 
through pointer traversals from the newly aliased node. For a structure with 
loops, such as a circular link list, infinite number of aliases will be created. 
The function CreatePtri{ds, d/) remains the same because no aliases are ever 
created during the creation of new nodes. 

function TraversePtri ( ds,d' ) ^ 

= f{^) 

(exists, 77, a, origiT') = GetNode(ds,7/) 
if not exists t return ( ds < 0 : [TIq, 0] > ) 0 
elseif (origlT' ^0) ^ return ( ds < orig^ : [77, «]>)•() 
else return { ds < W : [77, a] > ) 

0 

function TraversePeuptri ( ds,'P ) ^ 

^ = /®i(iF) 

(exists, 77, a, origiT^) = GetNode(ds,7') 
if not exists t return ( ds < 0 : [77o, 0] > ) 0 
elseif (origiF ^0) ^ return ( ds < orig'P : [77, of] > ) •(> 
else return { ds < : [77, a] > ) 

0 



It should be noted that the implementation of an aliased data structure is 
more general than an unaliased data structure. Therefore an unaliased data 
structure can also be correctly handled by the implementation for an aliased 
data structure. In such a situation no aliases will be created and hence the 
list of aliases will always be empty for all nodes. 
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5. Compile-Time Optimizations 

The focus of this chapter has been on translation or code generation rules 
SPMD execution. In reality to obtain good performance, optimizations must 
also be performed. In context of programs based upon shared data arrays, 
communication optimizations have been found to be useful [2]. Similar op- 
timizations can also be applied to programs with dynamic data structures. 
Some additional optimizations should be performed in context of dynamic 
data structures. 

Faster traversal functions: Dynamic data structures are accessed through 
link traversals which can be slow in contrast to indexed random accesses to 
array elements. Thus, techniques to allow faster traversal of dynamic data 
structures would be of potential benefit. In the linked list example of the 
preceding section, each processor iterates through all of the node positions in 
the linked list, although each processor only operates on those elements of the 
list that are local to itself. Through static analysis the compiler can determine 
that the iterations of the loop can be executed in parallel. Thus, the overhead 
of traversal can be reduced by using a function which only iterates through 
the nodes that reside on the same processor. 

Parallel creation of dynamic data structures: While it is clear that par- 
allelism is exploited during SPMD execution of code that operates on a 
distributed data structure, the same is not true during the creation of a 
distributed dynamic data structure. This is because each processor must up- 
date its hash table when a node is added to the data structure. Through 
compile-time analysis the situations in which a data structure can be created 
in parallel by a distinct creation phase can be identified. The updating of the 
hash table can be delayed to the end of the creation phase. Each processor 
can broadcast the names of the nodes that it created at the end of the cre- 
ation phase to update the hash tables of other processors. In this approach 
parallelism is exploited during the creation of the data structure and the 
updating of the hash tables is also carried out in parallel. 

Storage optimization: Another source of significant overhead in programs 
with dynamic data structures arises due to dynamic allocation and dealloca- 
tion of storage. During the restructuring of a data structure we can reduce 
the storage allocation overhead by reusing already allocated storage. For ex- 
ample, if elements of a distributed data structure are redistributed among 
the processors, storage freed by elements transferred from a processor can be 
reused by elements transferred to the processor. 



6. Related Work 

This chapter addressed the problem of supporting dynamic data structures 
on distributed-memory machines using owner-computes rule based SPMD 
execution. The implementation of globally shared dynamic data structures is 
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made possible by assignment of global names to the elements in a data struc- 
ture. The compiler is responsible for translating a sequential program into 
parallel code that executes in SPMD mode. In spirit this approach is similar 
to the approach originally proposed by Callahan and Kennedy for programs 
with shared data arrays [1]. The techniques described in this chapter differ 
from the preliminary version of this work [3] in the following way. In ear- 
lier work aggregate operations, such as merging and splitting, of distributed 
data structures were not allowed and hierarchical data structures were not 
considered. 

An alternative strategy for supporting dynamic data structures has been 
proposed in [7]. While the approach presented in this chapter relies on the 
transfer of distributed data among the processors to enable the execution 
of program statements, the approach proposed by Rogers et al. relies of 
migration of computation across the processors to enable the execution of 
statements that require operands distributed among various processors. 

To obtain good performance, programs with dynamic data structures 
must be transformed to expose greater degrees of parallelism. Furthermore 
techniques for identifying data dependences among program statements are 
required to perform these transformations. Language extensions can also be 
developed to facilitate the analysis. In [4,5] Hendren et al. address the above 
issues. 
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Summary. The goal of the Olden project is to build a system that provides par- 
allelism for general-purpose C programs with minimal programmer annotations. 
We focus on programs using dynamic structures such as trees, lists, and DAGs. 
We describe a programming and execution model for supporting programs that use 
pointer-based dynamic data structures. The major differences between our model 
and the standard sequential model are that the programmer explicitly chooses a 
particular strategy to map the dynamic data structures over a distributed heap, 
and annotates work that can be done in parallel using futures. Remote data access 
is handled automatically using a combination of software caching and computa- 
tion migration. We provide a compile-time heuristic that selects between them for 
each pointer dereference based on programmer hints regarding the data layout. 
The Olden profiler allows the programmer to verify the data layout hints and to 
determine which operations in the program are expensive. We have implemented 
a prototype of Olden on the Thinking Machines CM-5. We report on experiments 
with eleven benchmarks. 



1. Introduction 

To use parallelism to improve performance, a programmer must find tasks 
that can be done in parallel, manage the creation of threads to perform 
these tasks and their assignment to processors, synchronize the threads, and 
communicate data between them. Handling all of these issues is a complex and 
time consuming task. Complicating matters further is the fact that program 
bugs may be timing-dependent, changing in nature or disappearing in the 
presence of monitoring. Consequently, there is a need for good abstractions 
to assist the programmer in performing parallelization. These abstractions 
must not only be expressive, but also efficient, lest the gain from parallelism 
be outweighed by the additional overhead introduced by the abstractions. 

Olden is a compiler and run-time system that supports parallelism on 
distributed-memory machines for general purpose C programs with minimal 
programmer annotations. Specifically, Olden is intended for programs that 
use dynamic data structures, such as trees, lists and DAGs. Although much 
work has been done on compiling for distributed-memory machines, much 
of this work has concentrated on scientific programs that use arrays as their 
primary data structure and loops as their primary control structure. ^ These 
techniques are not suited to programs that use dynamic data structures [47], 

1 For example, see [2], [3], [12], [23], [27], [36], [48], and [52]. 

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 709-749, 2001. 

Springer-Verlag Berlin Heidelberg 2001 
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because they rely on the fact that arrays, unlike dynamic data structures, 
are statically defined and directly addressable. 

Olden’s approach, by necessity, is much more dynamic than approaches 
designed for arrays. Instead of having a single thread running on each proces- 
sor as is common in array-based approaches, we use futures to create parallel 
threads dynamically as processors become idle. The programmer marks a 
procedure call with a future if the procedure can be executed safely in par- 
allel with its parent continuation. To handle remote references, Olden uses 
a combination of computation migration and software caching. Computa- 
tion migration sends the computation to the data, whereas caching brings 
the data to the computation. Computation migration takes advantage of the 
spatial locality of nearby objects in the data structure; while caching takes 
advantage of temporal locality and also allows multiple objects on different 
processors to be accessed more efficiently. 

Selecting whether to use computation migration or software caching for 
each program point would be very tedious, so Olden includes a compile-time 
heuristic that makes these choices automatically. The mechanism choice is 
dependent on the layout of the data; therefore, we provide a language exten- 
sion, called local path lengths, that allows the programmer to give a hint about 
the expected layout of the data. Given the local-path-length information (or 
using default information if none is specified), the compiler will analyze the 
way the program traverses the data structures, and, at each program point, 
select the appropriate mechanism for handling remote references. 

What constitutes a good data layout and the appropriate local-path- 
length values for each data structure are not always obvious. To assist the 
programmer, we provide a profiler, which will compute the appropriate local- 
path-length values from the actual data layout at run time, and also report 
the number of communication events caused by each line of the program. 
Using this feedback, the programmer can focus on the key areas where op- 
timization will improve performance and make changes in a directed rather 
than haphazard manner. 

The rest of this chapter proceeds as follows: in Sect. 2, we describe Olden’s 
programming model. We describe the execution model, which includes our 
mechanisms for migrating computation based on the layout of heap-allocated 
data, for sending remote data to the computation that requires it using soft- 
ware caching, and also for introducing parallelism using futures, in Sect. 3. 
Then, in Sect. 4, we discuss the heuristic used by the compiler to choose 
between computation migration and software caching. In Sect. 5, we report 
results for a suite of eleven benchmarks using our implementation on the 
Thinking Machines CM-5, and in Sect. 6, describe the Olden profiler and 
how to use it to improve performance. Finally, we contrast Olden with other 
projects in Sect. 7 and conclude in Sect. 8. 

This paper summarizes our work on Olden. Several other sources [13, 14, 
47] describe this work in more detail. 
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2. Programming Model 

Olden’s programming model is designed to facilitate the parallelization of C 
programs that use dynamic data structures on distributed-memory machines. 
The programmer converts a sequential C program into an Olden program 
by providing data-layout information, and marking work that can be done 
in parallel. The Olden compiler then generates an SPMD (single-program, 
multiple-data) program [34] . The necessary communication and thread man- 
agement are handled by Olden’s run-time system. In this section, we describe 
the programmer’s view of Olden. 

Our underlying machine model assumes that each processor has an identi- 
cal copy of the program, as well as a local stack that is used to store procedure 
arguments, local variables, and return addresses. Additionally, each processor 
owns one section of a distributed heap. Addresses in this distributed heap are 
represented as a pair, p, I , that contains a processor name and a local ad- 
dress, which can be encoded in a single 32-bit word. All pointers are assumed 
to point into the distributed heap. 



2.1 Programming Language 

Olden takes as input a program written in a restricted subset of C, with 
some additional Olden-specific annotations. For simplicity, we assume that 
there are no global variables (these could be put in the distributed heap) . We 
also require that programs do not take the address of stack-allocated objects, 
which ensures that all pointers point into the heap.^ The major differences 
between our programming model and the standard sequential model are that 
the programmer explicitly chooses a particular strategy to map the dynamic 
data structures over the distributed heap and annotates work that can be 
done in parallel. We provide three extensions to C — ALLOC, local path lengths, 
and futures — to allow the programmer to specify this information. ALLOC 
and local path lengths are used to map the dynamic data structures, and 
provide information to the compiler regarding the mapping. Futures are used 
to mark available parallelism, and are a variant of a construct used in many 
parallel Lisps [26]. Throughout this section, we will examine how a very 
simple function, TreeAdd, would be modified for Olden. TreeAdd recursively 
sums the values stored in each of the nodes of the tree. The program is given 
in Fig. 2.1. 

2.2 Data Layout 

Olden uses data layout information provided by the programmer both at run 
time and during compilation. The actual mapping of data to the processors 

^ We do provide structure return values, which can be used to handle many of 
the cases where & (address-of) is needed. 




712 Martin C. Carlisle and Anne Rogers 



typedef struct tree { int val; 

struct tree *left, *right; ]- tree; 



int TreeAdd (tree *t) 

if (t == NULL) 
return 0; 
else { 

return (TreeAdd(t->left) + TreeAdd(t->right) + t->val) ; 

> 

} 



Fig. 2.1. TreeAdd function. 



is achieved by including a processor number in each allocation request. Olden 
provides a library routine, ALLOC, that allocates memory on a specified pro- 
cessor, and returns a pointer that encodes both the processor name and the 
local address of the allocated memory. Olden also provides a mechanism to 
allow the programmer to provide a hint about the expected run-time layout 
of the data. 



/* Allocate a tree with level levels on processors lo . . lo+num_proc-l 
Assume num_proc is a power of 2 */ 

tree *TreeAlloc (int level, int lo, int num_proc) 

if (level == 0) 
return NULL; 
else { 

tree *new, *right, *left; 
int mid, lo_tmp; 

new = (tree *) ALL0C(lo , sizeof (struct tree)); 
new->val = 1 ; 

new->left = TreeAlloc (level-1 , lo+num_proc/2 , num_proc/2) ; 
new->right = TreeAlloc (level-1 , lo, num_proc/2) ; 
return new; 

} 

} 



Fig. 2.2. Allocation code 



Since the heap is distributed and communication is expensive, to get good 
performance, the programmer must place related pieces of data on the same 
processor. For a binary tree, it is often desirable to place large subtrees to- 
gether on the same processor, as it is expected that subtrees contain related 
data. In Fig. 2.2, we give an example function that allocates a binary tree 
such that the subtrees at a fixed depth are distributed evenly across the pro- 
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cessors (where the number of processors is a power of two). Figure 2.3 shows 
the distribution of a balanced binary tree that would be created by a call to 
TreeAlloc on four processors with lo equal to zero. 

To assist the compiler in managing communication, we allow the pro- 
grammer to provide a quantified hint regarding the layout of a recursive data 
structure. A local path length represents the expected number of adjacent 
nodes in the structure that are on the same processor, measured along a 
path. Each pointer field of a data structure has a local path length, either 
specified by the programmer or a compiler- supplied default. Associating a 
local path length, I, with a field, F, of a structure indicates that, on average, 
after traversing a pointer along field F that crosses a processor boundary, 
there will be a path of I adjacent nodes in the structure that reside on the 
same processor. 




Fig. 2.4. A simple linked list 



Determining the local path length is simplest for a linked list. Consider, 
for example, the linked list in Fig. 2.4. In this list, there are two nodes on 
Processor 0, followed by four nodes on Processor 1, and three nodes on Pro- 
cessor 2. Consequently, the local path length for the pointer field would be 
2±|±^, which is 3. 

For structures with more than one pointer field, such as the tree used 
in TreeAdd, determining the appropriate value for the local path length is 
more complicated. Local path lengths, however, are merely a hint, and may 
be approximated or omitted (in which case a default value is used). The 
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compiler’s analyses are insensitive to small changes in the local path lengths, 
and incorrect values do not affect program correctness. In most cases, it 
suffices to estimate the value using a high-level analysis of the structure. 
Recall the allocation shown in Fig. 2.3. Suppose the height of the tree is 12. 
On average, a path from the root of the tree to a leaf crosses a processor 
boundary once. We can therefore estimate the local path length as 6 for 
both pointer fields. In Fig. 2.5, we show a data structure declaration with 
local-path-length hints. 



typedef struct tree { 
int val; 

struct tree *left -[6]-; 
struct tree *right {6]-; 

}• tree; 

Fig. 2.5. Sample local-path-length hint 



In Sect. 6, we describe the Olden profiler and also provide a mathemati- 
cally rigorous presentation of local path lengths. The profiler automatically 
computes the local path lengths of a data structure by performing a run- 
time analysis. For nine of the eleven benchmarks we implemented, using 
the compiler-specified default local path lengths yielded the same perfor- 
mance as using the exact values computed by the profiler. In the cases where 
the profiler-computed values differed significantly from the defaults, it was 
straightforward to estimate the correct values. Since the programmer can 
guess or use default local-path-length values, and then verify these with the 
profiler, it has never been necessary to perform a detailed analysis of local 
path lengths. 

2.3 Marking Available Parallelism 

In addition to specifying a data layout, the programmer must also mark 
opportunities for parallelism. In Olden, this is done using futures, a variant 
of the construct found in many parallel Lisps [26]. The programmer may 
mark a procedure call as a futurecall, if it may be evaluated safely in parallel 
with its parent context. The result of a futurecall is a future cell, which serves 
as a synchronization point between the parent and the child. A touch of the 
future cell, which synchronizes the parent and child, must also be inserted by 
the programmer before the return value is used. 

In our TreeAdd example, the recursive calls on the left and right subtrees 
do not interfere; therefore, they may be performed safely in parallel. To spec- 
ify this, we mark the first recursive call as a futurecall. This indicates that the 
continuation of the first recursive call, namely the second recursive call, may 
be done in parallel with the first call. Since the continuation of the second 
recursive call contains no work that can be parallelized, we do not mark it 
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int TreeAdd (tree *t) 

{ 

if (t == NULL) 
return 0; 
else -[ 

tree *t_left; 

future_cell_int 

int sum_right ; 



t_left = t->left; 

f_left = futurecall (TreeAdd, t_left); 

sum_right = TreeAdd (t->right) ; 

return itouch(f_left) + sum_right + t->val) ; 



Fig. 2.6. TreeAdd function using future. 



as a future. Then, when the return value is needed, we must first explicitly 
touch the future cell. In Fig. 2.6, we show the TreeAdd program as modified 
to use futures. 



3. Execution Model 

Once the programmer has specified the data layout and identified opportuni- 
ties for parallelism, the system must still handle communication, work distri- 
bution, and synchronization. In this section, we describe how Olden handles 
remote references and how Olden extracts parallelism from the computation. 
To handle remote references, Olden uses a combination of computation migra- 
tion, which moves the computation to the data, and software caching, which 
brings a copy of the data to the computation. Parallelism is introduced using 
futures. At the end of this section, we present an example, TreeMultAdd, that 
illustrates the execution model in action. 

Due to space constraints, we do not discuss the implementation of Olden’s 
run-time system in any detail. We refer the interested reader to Rogers et 
al. [47] and Carlisle [14] for details. 

3.1 Handling Remote References 

When a thread of computation attempts to access data from another proces- 
sor, communication must be performed to satisfy the reference. In Olden, we 
provide two mechanisms for accessing remote data: computation migration 
and software caching. These mechanisms may be viewed as duals. Computa- 
tion migration sends the thread of computation to the data it needs; whereas. 




716 Martin C. Carlisle and Anne Rogers 



software caching brings a copy of the data to the computation that needs it. 
The appropriate mechanism for each point in the program is selected auto- 
matically at compile time using a mechanism that is described in Sect. 4. 

3.1.1 Computation Migration. The basic idea of computation migration 
is that when a thread executing on Processor P attempts to access a location^ 
residing on Processor Q, the thread is migrated from P to Q. Full thread 
migration entails sending the current program counter, the thread’s stack, 
and the current contents of the registers to Q. Processor Q then sets up 
its stack, loads the registers, and resumes execution of the thread at the 
instruction that caused the migration. Once it migrates the thread. Processor 
P is free to do other work, which it gets from a work dispatcher in the run- 
time system. 

Full thread migration is quite expensive, because the thread’s entire stack 
is included in the message. To make computation migration affordable, we 
send only the portion of the thread’s state that is necessary for the current 
procedure to complete execution: the registers, program counter, and current 
stack frame. Later we will explain computation migration in the context 
of a real program; here, we use the example in Fig. 3.1 to illustrate the 
concepts. During the execution of H, the computation migrates from P to Q. 
Q receives a copy of the stack frame for H, which it places on its stack. Note, 
however, that when it is time to return from the procedure, it is necessary 
to return control to Processor P, because it holds the stack frame of H’s 
caller. To accomplish this, Q places a stack frame for a special return stub 
directly below the frame for H. This frame holds the return address and 
the return frame pointer for the currently executing function. The return 
address stored in the frame of H is modified to point to a stub procedure. The 
stub procedure migrates the thread of computation back to P by sending a 
message that contains the return frame pointer (< b >), and the contents of 
the registers. Processor P then completes the procedure return by loading the 
return address from its copy of the frame, deallocating its copy of the frame, 
and then restarting the thread at the return address. Note Q does not need 
to return the stack frame for H to P, as it will be deallocated immediately. 

Olden implements a simple optimization to circumvent a chain of trivial 
returns in the case that a thread migrates several times during the course of 
executing a single function. Upon a migration, the run-time system examines 
the current return address of the function to determine whether it points to 
the return procedure. If so, the original return address, frame pointer, and 
node id are pulled from the stub’s frame and passed as part of the migration 



^ The Olden compiler inserts explicit checks into the code to test a pointer deref- 
erence (recall that pointers encode both a processor name and a local address) 
to determine if the reference is local and to migrate the thread as needed. On 
machines that have appropriate support, the address translation hardware can 
be used to detect non-local references. This approach is preferable when the 
ratio of cost of faults to cost of tests is less than the ratio of tests to faults [4]. 
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message. This allows the eventual return message to be sent directly to the 
original processor. 



Processor P 



Processor P 



Processor Q 






before migration 



after migration 



<x> is a frame pointer 



<a> 



Processor P 



G 



F 



current 
stack frame 



after return 

Fig. 3.1. Stack during computation migration (stacks grow up) 



3.1.2 Software Caching. There are occasions where it is preferable to 
move the data rather than the computation. To accomplish this, we use a 
software-caching mechanism that is very similar to the caching scheme in 
Blizzard-S [50]. Each processor uses a portion of its local memory as a large, 
fully associative, write-through cache. A write-through cache is used because 
update messages can be sent cheaply from user level on the CM-5, and this 
allows us to overlap these messages with computation. As in Blizzard-S, we 
perform allocation on the page level, and perform transfers at the line level. ^ 
The main difference between Olden’s cache and that in Blizzard-S is that we 
do not rely on virtual-memory support. We use a IK hash table with a list 

In Olden, a page has 2K bytes, and a line has 64 bytes. 



4 
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of pages kept in each bucket to translate addresses. Since entries are kept on 
a per-page basis, the chains in each bucket will tend to be quite short (in our 
experience, the average chain length is approximately one). 

The Olden compiler directly inserts code before each heap reference that 
uses software caching. This code searches the lists stored in the hash table, 
checks the valid bit for the line, and returns a tag used to translate the address 
from a global to a local pointer. In the event that the page is not allocated 
or the line is not valid, the appropriate allocation or transfer is performed by 
a library routine. 

Once we introduce a local cache at each processor, we must provide a 
means to ensure that no processor sees stale data. Olden uses a relaxed co- 
herence scheme where each processor invalidates its own cache at synchro- 
nization points. This coherence mechanism can be shown to be equivalent to 
sequential consistency [37] for Olden programs and is discussed in detail in 
Carlisle’s dissertation [14]. 



3.2 Introducing Parallelism 

While computation migration and software caching provide mechanisms for 
operating on distributed data, they do not provide a mechanism for extracting 
parallelism from the computation. When a thread of computation migrates 
from Processor P to Q, P is left idle. In this section, we describe a mecha- 
nism for introducing parallelism. Our approach is to introduce continuation- 
capturing operations at key points in the program. When a thread migrates 
from P to Q, Processor P can start executing one of the captured continua- 
tions. The natural place to capture continuations is at procedure calls, since 
the return linkage is effectively a continuation. This provides a fairly inexpen- 
sive mechanism for labeling work that can be done in parallel. In effect, this 
capturing technique chops the thread of execution into many pieces that can 
be executed out of order. Thus the introduction of continuation-capturing 
operations must be based on an analysis of the program, which can be done 
either by a parallelizing compiler targeted for Olden or by a programmer 
using Olden directly. 

Our continuation-capturing mechanism is essentially a variant of the fu- 
ture mechanism found in many parallel Lisps [26]. In the traditional Lisp 
context, the expression (future e) is an annotation to the system that says 
that e can be evaluated in parallel with its context. The result of this evalu- 
ation is a future cell that serves as a synchronization point between the child 
thread that is evaluating e and the parent thread. If the parent touehes the 
future cell, that is, attempts to read its value, before the child is finished, 
then the parent blocks. When the child thread finishes evaluating e, it puts 
the result in the cell and restarts any blocked threads. 

Our view of futures, which is influenced by the lazy-task-ereation scheme 
of Mohr et al. [41], is to save the futurecall’s context (return continuation) on 
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a work list (in our case, a stack) and to evaluate the future’s body directly.® 
If a processor becomes idle (either through a migration or through a blocked 
touch), then we grab a continuation from the work list and start executing it; 
this is called future stealing. In most parallel lisp systems, touches are implicit 
and may occur anywhere. In Olden, touches are done explicitly using the 
touch operation and there are restrictions on how they may be used. The first 
restriction is that only one touch can be attempted per future. The second 
is that the one allowed touch must be done by the future’s parent thread 
of computation. These restrictions simplify the implementation of futures 
considerably and we have not found any occasions when it would be desirable 
to violate them. 

Due to space constraints, we will not discuss the implementation of fu- 
tures, except to note an important fact about how Olden uses the contin- 
uation work list. When a processor becomes idle, it steals work only from 
its own work list. A processor never removes work from the work list of an- 
other processor. The motivation for this design decision was our expectation 
that most futures captured by a processor would operate on local data. Al- 
though allowing another processor to steal this work may seem desirable for 
load-balancing purposes, it would simply cause unnecessary communication. 
Instead, load balancing is achieved by a careful data layout. This is in con- 
trast to Mohr et al.’s formulation, where an idle processor removes the oldest 
piece of work from another processor’s work queue. 



3.3 A Simple Example 

To make the ideas of this chapter more concrete, we present a simple example. 
TreeMultAdd is a prototypical divide-and-conquer program, which computes, 
for two identical binary trees, the sum of the products of the values stored 
at corresponding nodes in the two trees. 

Figure 3.2 gives an implementation of TreeMultAdd that has been anno- 
tated with futures. We mark the left recursive call to TreeMultAdd with a 
f uturecall; the result is not demanded until after the right recursive call. We 
assume that the compiler has chosen to use computation migration to satisfy 
remote dereferences of t and software caching to satisfy remote dereferences 
of u. 

To understand what this means in terms of the program’s execution, con- 
sider a call to TreeMultAdd on the two trees whose layout is shown in Fig. 3.3. 
Consider the first call to TreeMultAdd, made on Processor 0 with to and uq as 
arguments. Since both t and u are local, no communication is needed to com- 
pute t_left and u_left. Then, once the recursive call on the left subtrees 
is made, t, now pointing to ti is non-local. Since the compiler has selected 
computation migration for handling remote references to t, the statement 

® This is also similar to the workcrews paradigm proposed by Roberts and Van- 
devoorde [46]. 
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t_left = t->left will cause the computation to migrate to Processor 1. 
After the migration, u, now pointing to Ui, is also non-local, and the state- 
ment u = u->left will cause a cache miss. As the computation continues 
on the subtrees rooted at t\ and ui, all references to t will be local and all 
references to u remote. Consequently, no further migrations will occur, and 
the subtree rooted at ui will be cached on Processor 1. Once the computation 
on the subtrees rooted at ti and ui is complete, a return message will be sent 
to Processor 0. 

int TreeMultAdd (tree *t , tree *u) 

if (t == NULL) f 

assert (u == NULL) ; 
return 0; 

} 

else { 

tree *t_left, *u_left; 

future_cell f_left; 
int sum_right ; 

assert (u ! = NULL) ; 

t_left = t->left; /* may cause a migration */ 

u_left = u->left; /* may cause cache miss */ 

f_left = futurecall (TreeMultAdd, t_left, u_left) ; 

sum_right = TreeMultAdd (t->right ,u->right) ; 

return (touch(f _left) + sum_right + t->val*u->val) ; 

} 

} 



Fig. 3.2. TreeMultAdd with Olden annotations. 





Fig. 3.3. Example input for TreeMultAdd 
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Meanwhile, after the migration, Processor 0 is idle. Therefore, it will steal 
the continuation of the first recursive call. This continuation starts the recur- 
sive call on the right subtrees, rooted at t 2 and U 2 - Note all dereferences of t 
are local, and all dereferences of u remote. The subtree rooted at U 2 will be 
cached on Processor 0, and no migrations will occur. Once this computation 
completes. Processor 0 will attempt to touch the future cell associated with 
the call with arguments t\ and u\. At this point, execution must wait for the 
return message from Processor 1. An execution trace for both processors is 
shown in Fig. 3.4. 



Processor 0 Processor 1 

t=to , u=Mo idle 

t=ti , u=Mi idle 

__^]2iigration . 

steal future 

t=l2 , U=U2 ^ ► 

computation rooted at t 2 , U 2 t=ti, u=ui 



computation rooted at ti, ui 



touch future subtree completes 




resume waiting future 
computation completes 

Fig. 3.4. Execution trace for TreeMultAdd 



As previously mentioned, since the compiler has chosen to use caching to 
satisfy remote dereferences of u, the subtree rooted at ui will end up being 
cached on Processor 1, and the subtree rooted at 1 x 2 on Processor 0. Had 
the compiler chosen to use migration for dereferences of both t and u, the 
computation would have bounced back and forth between the two processors. 
Here we see the advantage of using both computation migration and software 
caching. By using migration, we obtain locality for all references to the tree t, 
and by using caching, we prevent the computation from migrating back and 
forth repeatedly between the processors. We examine further the benefits of 
using both computation migration and software caching in Sections 4 and 5. 

Neither the reference to t->right nor the reference to t->val can ever be 
the source of a migration. Once a non-local reference to t->left causes the 
computation to migrate to the owner of t, the computation for the currently 
executing function will remain on the owner of t until it has completed (since 
we assumed references to u will use caching rather than migration). 
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4. Selecting Between Mechanisms 



As mentioned earlier, the Olden compiler decides, for each pointer derefer- 
ence, whether to use caching or migration for accessing remote data. Our 
goal is to minimize the total communication cost over the entire program. 
Consequently, although an individual thread migration is substantially more 
expensive than performing a single remote fetch (by a factor of about seven 
on the CM-5), it may still be desirable to pay the cost of the migration, if 
moving the thread will convert many subsequent references into local refer- 
ences. Consider a list of N elements, evenly divided among P processors (two 
possible configurations are given in Fig. 4.1). First suppose the list items are 
distributed in a block fashion. A traversal of this list will require N 
remote accesses if software caching is used, but only P <g> 1 migrations if the 
computation is allowed to follow the data (assuming N ^ P). Hence, it is 
better to use computation migration for such a data layout. Caching, however, 
performs better when the list items are distributed using a cyclic layout. In 
this case, using computation migration will require N migrations, whereas 
caching requires remote accesses. 



Blocked distribution 




Fig. 4.1. Two different list distributions. The numbers in the boxes are 
processor numbers. Dotted arrows represent a sequence of list items. 

Olden uses a three-step process to select a mechanism for each program 
point. First, the programmer specifies local path lengths, which give hints to 
the compiler regarding the layout of the data. Second, a data flow analysis is 
used to find pointers that traverse the data structure in a regular manner. In 
each loop (either iterative or recursive), at most one such variable is selected 
for computation migration. Finally, interactions between loops are consid- 
ered, and additional variables are marked for caching, if it is determined that 
using computation migration for them may cause a bottleneck. 

4.1 Using Local Path Lengths 

Since the communication cost of using a particular mechanism for a particu- 
lar program fragment is highly dependent on the layout of the data, we allow 
the programmer to provide a quantified hint to the compiler regarding the 
layout of a recursive data structure. As previously discussed in Sect. 2, each 
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pointer field of a structure may be marked with a local path length, which 
represents the expected number of adjacent nodes in the structure that are 
all on the same processor; if the programmer does not specify a local path 
length, a default value is supplied. To be more precise, a local path length of I 
associated with field, F, indicates that, on average, after traversing a pointer 
along field F that crosses a processor boundary, there will be a path of I adja- 
cent nodes in the structure that reside on the same processor. Consequently, 
the local path length provides information regarding the relative benefit of 
using computation migration or software caching, as it provides information 
about how many future references will be made local by performing a migra- 
tion. In Sect. 6, we discuss how to compute local path lengths in more detail, 
and present the Olden profiler, which computes the local path lengths at run 
time. 

The compiler converts local path lengths into probabilities, called path 
affinities. Recall that a geometric distribution is obtained from an infinite 
sequence of independent coin flips [45]. It takes one parameter, p, which is 
the probability of a success. The value of a geometric random variable is the 
number of trials up to and including the first success. If we consider traversing 
a local pointer to be a failure, then the path affinity is the probability of 
failure. The local path length is then the expected value of this geometric 
distribution; therefore, the path affinity is given by 1 (S> path length ' 

The intuition behind our heuristic’s use of local path lengths and path 
affinities is illustrated by the examples given in Fig. 4.1. In the blocked case, 
where computation migration is the preferred mechanism, the local path 
length is ^ (there are N list items, distributed in a blocked fashion across P 
processors). The path affinity of the next field is then 1 0 In the cyclic 
case, where caching performs better, the next field has a local path length 
of one (each next pointer is to an object on a different processor), and a 
path affinity of zero. In general, computation migration is preferable when 
the local path lengths and path affinities are large, and software caching is 
preferable when the local path lengths and path affinities are small. 

typedef struct tree { 
int val; 

struct tree *left -flO]-; 
struct tree *right {3.33]-; 

]- tree; 

Fig. 4.2. Sample local-path-length hint 



In the remainder of this section, we describe how the compiler uses the 
path affinities computed from the programmer-specified local path lengths to 
select between computation migration and software caching. In our examples, 
we will refer to the structure declaration from Fig. 4.2. For this declaration. 
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the path affinity of the left field is 1 0 or 90%, and the path afhnity of 
the right field is 1 0 or 70%. 

4.2 Update Matrices 

We want to estimate how the program will, in general, traverse its recursively 
defined structures. To accomplish this, we examine the loops and recursive 
calls (hereafter referred to as control loops) checking how pointers are updated 
in each iteration. We say that s is updated by t along field F in a given loop, 
if the value of s at the end of an iteration is the value of t from the beginning 
of the iteration dereferenced through field F (that is, s =t->F). This notion 
extends directly to paths of fields. Intuitively, variables that are updated by 
themselves in a control loop will traverse the data structure in a regular 
fashion. We call such variables induction variables. This is similar to the 
notion of induction variables used in an optimizing compiler’s analysis of for- 
loops [1]. In both cases, an induction variable is a variable that is updated 
in a regular fashion in each iteration of the loop. In Fig. 4.3, s and t are 
induction variables (since s =s->left and t =t->right->lef t), whereas u is 
not (since its value cannot be written as a path from its value in the previous 
iteration) . 



struct tree *s, *t, *u; 
while (s) { 
s = s->left; 
t = t->right->lef t ; 
u = s->right ; 



Update before 
Matrix s t u 

a s 
f 

t t 
e 

r u 



90 








63 




70 







Fig. 4.3. A simple loop with induction variables 



We summarize information on possible induction variables in an update 
matrix. The entry at location (s,t) of the matrix is the path affinity of the 
update, if s is updated by t, and is blank otherwise. In Fig. 4.3, since s is 
updated by itself along the field left, the entry (s , s) in the update matrix is 
90 (the affinity of the left field). Induction variables are then simply those 
pointers with entries along the diagonal (that is, they have been updated 
by themselves). In our example, the variables s and t have entries on the 
diagonal. We will consider only these for possible computation migration, as 
they traverse the structure in a regular manner. 

The update matrices may be computed using standard data-flow meth- 
ods. (Note again that exact or conservative information is not needed, as 
errors in the update matrices will not affect program correctness.) The only 
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struct tree *TreeSearch(struct tree *t, 
-[ 

while (t && t->key != key) { 
if (t->key > key) 
t = t->left; 
else 

t = t->right; 

} 

return t ; 



int key) 



Update 
Matrix s 

after t 



before 

t 



80 



Fig. 4.4. An example of updates with if-then 



complications are that variables may have multiple updates or update paths 
of length greater than one. There are three cases. The first is a join point 
in the flow graph (for example, at the end of an if-then statement). Here 
we simply merge the two updates from each branch by taking the average 
of their affinities. This corresponds to assuming each branch is taken about 
half of the time, and could be improved with better branch prediction in- 
formation. If the update does not appear in both branches and we have no 
branch prediction information, rather than averaging the update, we omit 
it. We do this because we wish to consider only those updates that occur in 
every iteration of the loop, thus guaranteeing that the updated variable is 
actually traversing the structure. Having a matrix entry for a variable that 
is not updated might cause the compiler to surmise incorrectly that it could 
make a large number of otherwise remote references local by using migration 
for this variable. An example of the branch rule is given in Fig. 4.4. The 
induction variable, t, has an update with path affinity 80, the average of the 
updates in the two branches (t->right 90%, t->left 70%). 



int TreeAddCstruct tree *t) 

{ 

if (t == NULL) return 0; 
else 

return TreeAdd(t->left) 

+ TreeAdd(t->right)+t->val; 

} 

Fig. 4.5. TreeAdd 



Update before 
Matrix ^ ^ 



after t 



97 



Second, we must have a rule for multiple updates via recursion. Consider 
the simple recursive program in Fig. 4.5. Note that t has two updates, one 
corresponding to each recursive call. The two recursive calls form a control 
loop. In this case, we define the path affinity of the update as the probability 
that either of the updates will be along a local path (since both are going 
to be executed). Because the path affinity of left is 90% and right is 70%, 
the probability that both are remote is 3% (assuming independence). Con- 
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sequently, the path affinity of the update of t because of the recursive calls 
is 97%, the probability that at least one will be local. Rather than combin- 
ing the updates from the two branches of the if-then-else statement, we 
instead notice that the recursive calls occur within the else branch. This 
means we can predict that the else branch is almost always taken, and con- 
sequently, only the rule for recursion is used to compute the update of t in 
this control loop. 

The final possibility is an update path of length greater than one (for 
example, t=t->right->lef t). The path affinity of this case is simply the 
product of the path affinities of each held along the path. An example of this 
is given in Fig. 4.3. Here the path affinity of the update of t is the product 
of the path affinities of the right and left fields (90% * 70% = 63%). 

So far, we have only discussed computing update matrices intra-procedur- 
ally. A full inter-procedural implementation would need to be able to compute 
paths generated by the return values of functions, and handle control loops 
that span more than one procedure (for example, a mutual recursion). Our 
implementation performs only a limited amount of inter-procedural analysis. 
In particular, we do not consider return values, or analyze loops that span 
multiple procedures. This limited inter-procedural analysis is sufficient for all 
of our benchmarks; in the future, it may be possible to expand this analysis 
using techniques such as access path matrices [30]. 

4.3 The Heuristic 

Once the update matrices have been computed, the heuristic uses a two-pass 
process to select between computation migration and software caching. First, 
each control loop is considered in isolation. Then, in the second phase, we 
consider the interactions between nested control loops, and possibly decide 
to do additional caching. In addition to having the update matrix for each 
control loop, we also need information regarding whether or not the loop may 
be parallelized. In Olden, the compiler checks for the presence of programmer- 
inserted futures to determine when a control loop may be parallelized. 

In the first pass, for each control loop, we select the induction variable 
whose update has the strongest path affinity. If a control loop has no induc- 
tion variable, then it will select computation migration for the same variable 
as its parent (the smallest control loop enclosing this one) . If the path affinity 
of the selected variable’s update exceeds a certain threshold, or the control 
loop is parallelizable, then computation migration is chosen for this variable; 
otherwise, dereferences of this variable are cached. Dereferences of all other 
pointer variables are cached. We select computation migration for paralleliz- 
able loops with path affinities below the threshold because this mechanism 
allows us to generate new threads. (Due to Olden’s use of lazy rather than ea- 
ger futures, new threads are generated only following migrations and blocked 
touches.) 
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Traverse (tree *t) -[ 
if (t==NULL) return; 
else 

Traverse (t->left) ; 

Traverse (t->right) ; 

} 

} 

WalkAndTraverse (list *1, tree *t) -[ 

for each body, b, in 1 do in parallel -[ 
Traverse (t) ; 

} 

> 




Fig. 4.6. Example with a bottleneck. 



Considering control loops in isolation does not yield the best performance. 
Inside a parallel loop, it is possible to create a bottleneck by using com- 
putation migration. Consider the code fragments in Figures 4.6 and 4.7. 
WalkAndTraverse is a procedure that for each list item traverses the tree.® If 
computation migration were chosen for the tree traversal, the parallel threads 
for each item in the list would be forced to serialize on their accesses to the 
root of the tree, which becomes a bottleneck. In TraverseAndWalk, for each 
node in the tree, we walk the list stored at that node. Since there is a different 
list at each node of the tree, the parallel threads at different tree nodes are 
not forced to serialize, and there is no bottleneck. In general, a bottleneck 
occurs whenever the initial value of a variable selected for migration in an 
inner loop is the same over a large number of iterations of the outer loop. 
Returning to the examples, in WalkAndTraverse, t has the same value for 
each iteration of the parallel for loop, while in TraverseAndWalk, we as- 
sume t->list has a different value in each iteration (that is, at each node 
in the tree). Although in general this is a difficult aliasing problem, we do 
not need exact or conservative information. If incorrect information is used, 
the program will run correctly, but possibly more slowly than if more precise 
information were available. Our current approximation tests to see if the in- 
duction variable for the inner loop is updated in the parent loop. If so, we 



® The syntax for specifying parallelism used in this example is not part of Olden 
and is used only to simplify the example. 
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assume no bottleneck will occur; otherwise, we use caching in the inner loop 
to avoid the possibility of a bottleneck. Once the heuristic has analyzed the 
interactions between loops, the selection process is complete. 

WalkClist *1) -[ 
while (1) -[ 
visit (1) ; 

1 = l->next ; 

> 

} 

TraverseAndWalkCtree *t) { 
if (t==NULL) return; 
else ■[ 

do in parallel { 

TraverseAndWalk(t->left) ; 
TraverseAndWalk(t->right) ; 

} 

Walk(t->list) ; 

} 

} 




Fig. 4.7. Example without a bottleneck. 



4.3.1 Threshold and Default Path Affinities. The migration threshold 
has been set to 86% for the CM-5 implementation. Since the cost of a migra- 
tion is about seven times that of a cache miss on the CM-5, the break-even 
local path length is seven, which corresponds to a path affinity of 86%. For 
other platforms, the threshold would be 1 (8) y, where r is the ratio of the 
cost of a migration to the cost of a cache miss. On platforms where latency is 
greater, such as networks of workstations, r (and consequently the threshold) 
will be smaller, and on platforms where latency is smaller, such as the Cray 
T3D [19], r and the threshold will be larger. 
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We set the default path affinity to 70% (this corresponds to a default local 
path length of 3.33). This value was chosen so that, by default, list traversals 
will use caching, tree traversals will use computation migration, and tree 
searches will use caching. The averaging method for recursive calls was also 
designed to obtain this behavior. To see why this is desirable, recall the four 
processor tree allocation shown in Fig. 2.3. Generally, as in this example, 
we expect large subtrees to be distributed evenly among the processors. In 
such a case, searches of the tree (such as the code in Fig. 4.4), will traverse 
a relatively short path that may cross a processor boundary several times. A 
complete tree traversal (such as the code in Fig. 4.5), however, will perform 
large local computations. Consequently, it is preferable to use migration for 
the traversal, and caching for the search. List traversals may be viewed as 
searches through a degenerate tree. 

In our experience, these defaults provide the best performance most of the 
time. In those cases where the defaults are not appropriate, the programmer 
can specify local-path-length hints. (We do not allow the programmer to 
modify the threshold, but the same effect can be obtained by modifying the 
local path lengths.) We explicitly specified local path lengths in three of the 
eleven benchmarks (TSP, Perimeter, and MST) since the default affinity did 
not reflect the layout of the data structure, but in only one case (TSP) did 
it have a significant effect on performance. We examine TSP more closely in 
the next section. 

Returning to the declaration in Fig. 4.2 (page 723), we see why providing 
more precise local path length information is often unnecessary. Using default 
local path lengths rather than those from Fig. 4.2 changes the entry for t 
in the TreeAdd matrix (Figure 4.5) from 97 to 91, and the entry for t in 
the TreeSearch matrix (Figure 4.4) from 80 to 70. Since these changes do 
not cross the threshold, the heuristic will make the same selections. Our 
experience demonstrates that there is broad latitude for imprecision in the 
local-path-length information. 

4.3.2 Example. We now return to the TreeMultAdd example from Fig. 3.2. 
If we use the default local-path-length values, both the left and right fields 
have local path length 3.33. The path affinities of these helds are 1 0 3 ^) or 
70%. 

There is one control loop in this program, consisting of the two recursive 
calls. Both t and u have two updates from the recursive calls. The path 
affinities of both of these updates are 70% (as both the left and right fields 
have path affinity 70%). As shown in Fig. 4.8, using the rule for combining 
multiple updates via recursion, the entries for t and u in the update matrix 
will both be 1 0 (1 0 -7)(1 0 .7), or 91%. Since t and u have entries on the 
diagonal of the matrix, they are both induction variables. 

When the heuristic examines this loop, it will select the induction variable 
whose update has the largest affinity for migration. Since the updates of both 
variables have the same affinity, one is arbitrarily chosen (we assume t). 




730 Martin C. Carlisle and Anne Rogers 



Update before 
Matrix . t u 



t 

after 

u 



Fig. 4.8. Update matrix for TreeMultAdd 



91 






91 



Input. Olden program with some local-path-length hints. 

Output. Program with each dereference marked for computation migration 
or 

software caching. 

Method. 

comment Compute path affinities 
foreach structure declaration, D, do 
foreach pointer field, F, do 

if field F of structure D has no programmer-specified 
local path length then 

mark field F of structure D with default local path length 
Convert local path length to path affinity 

comment Compute update matrices 
foreach loop, L, do 

Compute an update matrix, M(L) , for L 

comment Single-loop analysis 
foreach loop, L, do 

X ^ induction variable with largest path affinity from M(L) 
foreach dereference, R, in L do 
if R dereferences x then 

if (aff inity(M(L) ,x) > threshold) or (L is parallel) then 
mark R for computation migration 
else 

mark R for software caching 

else 

mark R for software caching 

comment Bottleneck analysis 
foreach parallel loop, P, do 

foreach loop, L, inside P, do 

X <— variable using computation migration in L 
if X is not updated in P then 

mark dereferences of x in L for software caching 

Fig. 4.9. Selecting between computation migration and software caching 
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References to u are then cached. Since there is only one loop, the bottleneck 
analysis will not be performed. 

4.3.3 Summary. In summary, Olden provides two mechanisms, computa- 
tion migration and software caching, for accessing remote data. Based on 
local-path-length information, the compiler automatically selects the appro- 
priate combination of these two mechanisms to minimize communication. 
The local path lengths may be obtained from the programmer, the Olden 
profiler (which will be presented in Sect. 6), or compiler defaults. The selec- 
tion process, as shown in Fig. 4.9, consists of four parts: converting local path 
lengths to probabilities, called path affinities; computing update matrices for 
each loop and recursion; analyzing loops in isolation; and using bottleneck 
analysis to analyze interactions between nested loops. We demonstrate the 
effectiveness of the heuristic on a suite of eleven benchmarks in the next 
section. 



5. Experimental Results 

In the previous sections, we have described Olden. In this section, we report 
results for eleven benchmarks using our Thinking Machines CM-5 implemen- 
tation. Each of these benchmarks is a C program annotated with futures, 
touches, calls to Olden’s allocation routine, and, in some cases, data-structure 
local-path-length information. 

We performed our experiments on CM-5s at two National Supercomput- 
ing Centers: NPAC at Syracuse University and NCSA at the University of 
Illinois. The timings reported are averages over three runs done in dedicated 
mode. For each benchmark, we present two one-processor versions. The se- 
quential version was compiled using our compiler, but without the overhead 
of futures, pointer testing, or our special stack discipline, which is described 
in Rogers et al. [47]. The one-processor Olden implementation (one) includes 
these overheads. The difference between the two implementations provides a 
measure of the overhead. 

Table 5.1 briefly describes each benchmark and Table 5.2 lists the run- 
ning time of a sequential implementation plus speedup numbers for up to 
32 processors for each benchmark. We report kernel times for most of the 
benchmarks to avoid having their data-structure building phases, which show 
excellent speedup, skew the results. Power, Barnes-Hut, and Health are the 
exceptions. We report whole program times (W) for Power and Barnes-Hut to 
allow for comparison with published results. We report whole program time 
for Health, because it does not have a data-structure build phase that would 
skew the results. We use a true sequential implementation compiled with our 
compiler for computing speedups. These sequential times are comparable to 
those using gcc with optimization turned off. Using an optimization level of 
two, the gcc code ranges from one (em3d, health and mst) to five (TSP) times 
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Table 5.1. Benchmark Descriptions 



Benchmarks 


Description 


Problem Size 


TreeAdd 


Adds the values in a tree 


1024K nodes 


MST 


Computes the minimum spanning tree of a 


IK nodes 


Power 


graph [8] 

Solves the Power System Optimization 
problem [39] 


10,000 

customers 


TSP 


Computes an estimate of the best Hamiltonian 
circuit for the Traveling-salesman problem [35] 


32K cities 


Barnes- Hut 


Solves the N-body problem using a hierarchical 
method [7] 


8K bodies 


Bisort 


Sorts by creating two disjoint bitonic sequences 
and then merging them [9] 


128K 

integers 


EM3D 


Simulates the propagation of electro-magnetic 
waves in a 3D object [20] 


40K nodes 


Health 


Simulates the Colombian health-care 
system [38] 


1365 villages 


Perimeter 


Computes the perimeter of a set of quad-tree 
encoded raster images [49] 


4K X 4K 
image 


Union 


Computes the union of two quad-tree encoded 
raster images [51] 


70,000 

leaves 


Voronoi 


Computes the Voronoi Diagram of a set of 
points [24] 


64K points 



Table 5.2. Results 



Benchmarks 


Heuristic 

choice 


Seq. 

time 

(sec.) 


Speedup by number of processors 
1 2 4 8 16 32 


TreeAdd 


M 


4.49 


0.73 


1.47 


2.93 


5.90 


11.81 


23.4 


MST 


M 


9.81 


0.96 


1.36 


2.20 


3.43 


4.56 


5.14 


Power'^ 


M 


286.59 


0.96 


1.94 


3.81 


6.92 


14.85 


27.5 


TSP 


M 


43.35 


0.95 


1.92 


3.70 


6.70 


10.08 


15.8 


Barnes-Hut'^ 


M-bC 


555.79 


0.74 


1.42 


3.00 


5.29 


8.13 


11.2 


Bisort 


M-bC 


31.41 


0.73 


1.35 


2.29 


3.52 


4.92 


6.33 


EM3D 


M-bC 


1.21 


0.86 


1.51 


2.69 


4.48 


6.72 


12.0 


Health'^ 


M-bC 


42.09 


0.95 


1.95 


3.89 


7.61 


14.72 


21.70 


Perimeter 


M-bC 


2.47 


0.86 


1.70 


3.37 


6.09 


9.86 


14.1 


Union 


M-bC 


1.46 


0.76 


1.48 


2.88 


5.11 


7.97 


11.30 


Voronoi 


M-bC 


49.73 


0.75 


1.38 


2.41 


4.23 


6.88 


8.76 



^ - Whole program times 
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faster. The gcc optimizations can be traced to better register allocation and 
handling of floating point arguments, both of which could be implemented in 
Olden by adding an optimization phase to the compiler. 



5.1 Comparison with Other Published Work 

Several of our benchmarks (Barnes-Hut, EM3D, and Power) come from the 
parallel computing literature and have CM-5 implementations available. In 
this section, we briefly describe these benchmarks and compare our results 
to those available in the literature. 

Barnes-Hut [7] simulates the motion of particles in space using a hierar- 
chical (0(n log n)) algorithm for computing the accelerations of the particles. 
The algorithm alternates between building an oct-tree that represents the 
particles in space and computing the accelerations of the particles. Falsafi et 
al. [21] give results for six different implementations of this benchmark; our 
results using their parameters (approximately 36 secs/iter) fall near the mid- 
dle of their range (from 15 to 80 secs/iter). In our implementation, however, 
the tree building phase is sequential and starts to represent a substantial frac- 
tion of the computation as the number of processors increases. We discuss 
this more later. 

EM3D models the propagation of electromagnetic waves through objects 
in three dimensions [20]. This problem is cast into a computation on an 
irregular bipartite graph containing nodes representing electric and magnetic 
field values (E nodes and H nodes, respectively). At each time step, new values 
for the E nodes are computed from a weighted sum of the neighboring H nodes, 
and then the same is done for the H nodes. Eor a 64 processor implementation 
with 320,000 nodes, our implementation performs comparably to the ghost 
node implementation of Culler et al. [20], yet does not require substantial 
modification to the sequential code. 

Power solves the Power- System- Optimization problem, which can be 
stated as follows: given a power network represented by a tree with the power 
plant at the root and the customers at the leaves, use local information to 
determine the prices that will optimize the benefit to the community [40]. 
It was implemented originally on the CM-5 by Lumetta et al. in a variant 
of Split-C. On 64 processors. Olden’s efficiency is about 80%, compared to 
Lumetta et al.’s 75%. 

5.2 Heuristic Results 

Our results indicate that the heuristic makes good selections, and only oc- 
casionally requires the programmer to specify local path lengths to obtain 
the best performance. For most of the benchmarks, the default local path 
lengths were accurate. For three of the eleven benchmarks (Perimeter, MST, 
and TSP) the defaults were not accurate. In the case of Perimeter and MST, 
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specifying more accurate local path length information did not affect perfor- 
mance. Perimeter does a tree traversal and as we noted earlier the heuristic 
chooses migration for tree traversals by default. The local path length infor- 
mation that we specified did not contradict this choice. In the case of MST, 
the path length that we specified was for a data structure that was completely 
local. As a result, the benchmark’s performance did not depend on whether 
the heuristic chose migration or caching. 

In the case of TSP, changing the local path length information did af- 
fect performance. TSP is a divide-and-conquer algorithm with a non-trivial 
merge phase. The merge phase of TSP takes two Hamiltonian cycles and a 
single point, and combines them into a single cycle. Each merge is sequential 
and walks through the left sub-result followed by the right sub-result, which 
requires a migration for each participating processor. Given the default local 
path length, the heuristic will select caching for traversing the sub-results, 
as the procedure resembles a list traversal. This requires that all of the data 
be cached on a single processor. Because the sub-results have a long local 
paths, using migration results in 0(number of processors) migrations rather 
than O (number of cities) remote fetches from using caching. As the number 
of cities greatly exceeds the number of processors, much less communication 
is required using migration. By specifying this higher local path length value, 
we obtain a speedup of 15.8 on 32 processors, as opposed to 6.4 with the 
default value. 

The Voronoi Diagram benchmark is another case where the default values 
did not obtain the best performance; however, for this benchmark, chang- 
ing the values did not improve performance. Voronoi is another divide-and- 
conquer algorithm with a non-trivial merge phase. The merge phase walks 
along the convex hull of the two sub-diagrams, and adds edges to knit them to- 
gether to form the Voronoi Diagram for the whole set. Since the sub-diagrams 
have long local paths, walking along the convex hull of a single sub-result is 
best done with migration, but the merge phase walks along two sub-results, 
alternating between them in an irregular fashion. The ideal choice is to use 
migration to traverse one sub-result and cache the other (such a version has 
a speedup of over 12 on 32 processors). As we do not have good branch- 
prediction in formation, the heuristic instead chooses to pin the computation 
on the processor that owns the root of one of the sub-results and use software 
caching to bring remote sub-results to the computation. This version is not 
optimal, but nonetheless performs dramatically better than a version that 
uses only migration, which had a speedup of 0.47 on 32 processors. 

5.2.1 Other Performance Issues. Two benchmarks, MST and Barnes- 
Hut, while obtaining speedups, demonstrated possibilities for improvement. 
In both cases, the programs would benefit from more inexpensive synchro- 
nization mechanisms. For MST, time spent in synchronization caused poor 
performance. MST computes the minimum spanning tree of a graph using 
Bentley’s algorithm [8]. The performance for MST is poor and degrades 
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sharply as the number of processors increases, because the number of mi- 
grations is 0{NP), where N is the number of vertices, and P the number of 
processors. Caching would not reduce communication costs for this program, 
because these migrations serve mostly as a mechanism for synchronization. 

As noted earlier, Barnes-Hut uses an oct-tree that is rebuilt every iter- 
ation. In our implementation, the tree is built sequentially, because Olden 
lacks the synchronization mechanisms necessary allow multiple updates to 
a data structure to occur in parallel. As the cost of computing the forces 
on the particles, which is done in parallel, decreases the time spent on the 
tree-building phase becomes significant. 



5.3 Summary 

Overall, the Olden implementations provided good performance and required 
few programmer modifications. Where timings from other systems were avail- 
able (Power, Barnes-Hut, and EM3D), the Olden implementations performed 
comparably. In each case, the Olden implementation required less program- 
mer effort. These results indicate that Olden provides a simple yet efficient 
model for parallelizing programs using dynamic data structures. 



6. Profiling in Olden 

After getting a program to work correctly, the programmer’s attention usu- 
ally turns to making it run faster. The local path lengths in Olden provide 
a mechanism to allow the programmer to give hints to the system regarding 
the data layout. These hints allow the system to reduce communication by se- 
lecting the appropriate blend of computation migration and caching. As seen 
in TSP, changing these local path lengths may lead to dramatic changes in 
the speed of the algorithm. By having the system provide feedback regarding 
the choice of local path lengths, the programmer can avoid guesswork or a te- 
dious search of the parameter space in selecting the local-path-length values. 
Yet, even after obtaining correct local-path-length values, the program may 
still perform poorly. For example, a program with a poor data layout may 
have excessive communication overhead. Determining where the program is 
performing expensive operations, such as cache misses and migrations, and 
how often these occur may help the programmer improve the computation 
or the layout of the data. 

Olden provides profiling tools that allow the programmer to check the 
local path lengths of the fields of a data structure and also examine the 
number and type of communication events caused by each line in the program. 
The local path lengths of the fields of a structure may be computed at run 
time by calling an Olden procedure with a pointer to the root of the structure. 
When the appropriate compile-time flag is used, the compiler will generate 
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code to record the number of communication events corresponding to each 
line of the program. In this section, we discuss how local path lengths are 
verified. Olden’s event profiler is relatively straightforward and is discussed 
in Carlisle’s thesis [14]. 

6.1 Verifying Local Path Lengths 

Recall from Sect. 4 that the local path length of a field of a structure is a hint 
that gives information about the expected number of nodes that will be made 
local by a migration following a traversal of a given field of the structure. In 
this section, we describe formally how to compute a local path length for a 
particular field of a data structure. 

To simplify our definitions, we assume that each leaf node in the graph has 
one child, a special node, cj). The home of 4> is unique (i.e., j V, home{(j)) ^ 
home{v)), and cj) has no descendants. We ignore all null pointers. We define a 
local path to be a sequence of vertices, vi,V 2 , ■ ■ ■ ,Vn, along a traversable set of 
pointers in the graph such that home{vi) = home{vj). 

We say a local path is maximal if home{v\) ^home{v 2 ) and home{vn 0 i) ^ 
home{vn). The length of a local path, V\,V 2 , ■ ■ ■ ,u„, is n-2.’^ The lengths of 
the maximal local paths give an estimate of the benefit of using migration, 
as they provide an estimate of how many future references will be made local 
by performing a migration. We define the local path length of a field, F, of 
a data structure to be the average length of maximal local paths beginning 
with a pointer along field F. 




/ > = non-local pointer 



Fig. 6.1. A sample linked list 



For an example of computing local path lengths, consider the data struc- 
ture in Fig. 6.1. For the next field of this list, there are two non-local pointers, 
[u 4 ,f 5 ] and [u 7 ,U 8 j. The corresponding maximal local paths are [i> 4 ,...,U 8 ] 
and [i> 7 , . . . , Uii , (j)\, which have lengths of 3 and 4. Ignoring the initial pointer 
from Vo to i>i, the average maximal local path length is 3.5, and therefore the 
local path length for the next field is 3.5. If the traversal of the structure were 
to begin on a different processor than the owner of Vi, [uo,Ui, . . . ,^ 5 ] would 
be another maximal local path (where vq is a dummy vertex corresponding to 
the variable holding the pointer to i>i), and the average maximal local path 

^ This differs from the normal notion of path length (see, for example, [43]). The 
length of a local path expresses the number of vertices in the path that have 
the same home. 
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length would be For simplicity, we will assume for the remainder of this 
section that traversals begin on the owner of the root; however, the profiler 
considers initial pointers when computing local path lengths. 




Fig. 6.2. A sample tree with local-path-length information 



For structures that have more than one field, there are multiple maximal 
local paths corresponding to a single non-local pointer. In this case, we weight 
each maximal local path, P, by the total number of leaf-terminated paths that 
have P as a sub-path. This corresponds to assuming that all traversals are 
equally likely. Returning to the linked list example, each maximal local path is 
a sub-path of exactly one leaf-terminated path; therefore, we compute a sim- 
ple average of the maximal local path lengths. For an example of the weighted 
case, consider the tree in Fig. 6 . 2 . There are three maximal local paths be- 
ginning with a right edge ([fa, U7, Ui2, ^], [U4, ug, U13, (/>], and [U4, ug, um, ^]), 
and their weighted average local path length is 2 . 0 . For the left field, there 
are five maximal local paths, [ui, ug, U5, ^], [ui,U2,U4,ug], [ui, U2, U4, ug, (/>], 
[us, ug, uio, ^], and [us, ug, un, (/)], and their weights are 1,2, 1,1 and 1 respec- 
tively (note there are two leaf-terminated paths from ug, through U13 and 
U14). Consequently, their weighted average local path length is For this 
tree, the local path length of the right field is 2, and the local path length of 
the left field is 

We can compute these local path lengths in linear time with respect to 
the size of the graph using dynamic programming. For each node of the 
structure, using a simple depth-first traversal, we compute both the number 
of leaf-terminated paths from that node, and also the average local path 
length for paths beginning at that node. For a leaf node, the average local 
path length is zero, and the number of leaf-terminated paths is one. For any 
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interior node, if it has already been visited (as might occur in a traversal of a 
DAG), we immediately return the computed values. Otherwise, we traverse 
all of its children. The number of paths from an interior node is then simply 
the sum of the number of paths from each of its children. The average local 
path length for an interior node, x, is given by: 

same 4 >roc{c, x) \(1 + avgJocal 4 )athJength{c)) 

c children{x) 

where same 4 )roc{c, x) is 1 if c and x are on the same processor, and 0 
otherwise. Since we can compute the average local path length of a node, 
X, incrementally from the average local path lengths of its children using 
0{degree{x)) operations, we can perform the computation for all the nodes 
in the graph in linear time with respect to the size of the graph. 

During the same traversal, we can also compute the local path length 
for each field. If a non-local pointer is encountered during the traversal, we 
increment the total weight and weighted sum for that field. If a node, x, is 
visited a second time, we do not double count the non-local pointers that are 
in the subgraph rooted at x. Once the entire structure has been traversed, the 
local path length for the field is 100 if no non-local pointers are present along 
that field, or the computed value otherwise. Since the path afhnity is given by 
1 0 iocai path length compute path affinities using integer arithmetic, 

all local path lengths greater than or equal to 100 will be converted to a path 
affinity of 99%. 

The procedures used to compute local path lengths are generated auto- 
matically using preprocessor macros. To use the profiler to compute local 
path lengths for a tree, the programmer needs to add only two lines of code. 
In the header, the statement CHECK2(tree_t left, right, tree) informs 

the preprocessor to generate code to compute local path lengths for a data 
structure having two pointer fields, left and right, of type tree_t *. The 
last argument, tree, is used to generate a unique name. Once the data struc- 
ture has been built, a call to Docheck_tree(root) computes and prints the 
local path lengths for the two pointer fields. 

We used the profiler to compute local path lengths for each field in all of 
our benchmarks. For two of these benchmarks, TSP and Health, the heuristic 
will make different choices if the values computed by the profiler are used in 
place of the default local path lengths. Unfortunately, the results are mixed. 
For TSP we get a large performance gain by specifying a local path length 
other than the default; whereas for Health, we get a slight decrease in per- 
formance for large numbers of processors. 

The traversal of each sub-result of TSP has high locality, and thus we can 
reduce the communication substantially by using migration to traverse these 
sub-results. Using the profiler, we see that the run-time-computed local path 
length of the relevant field is 100, which is representative of this locality. By 
changing to the computed local path length value, we increase the speedup 
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while (p != patient) 

-[ 

p = list->patient ; 
list = list->f orward; 

} 

Fig. 6.3. Linked list traversal in Health 



for TSP on 32 processors from 6.4 to 15.8. Thus, with the profiler, the pro- 
grammer can discover this opportunity for increased performance without 
having to do a detailed analysis of the program. 




Fig. 6.4. Example linked list 



The performance tradeoff for Health is more subtle. For this benchmark, 
each list is located almost entirely on a single processor. Because of this, the 
use of the computed local path length values will reduce the number of pointer 
tests that must be performed. Consider the code fragment in Fig. 6.3. Using 
computation migration, we can eliminate the pointer test for the reference 
to list->patient, as we know that list is local following the reference, 
list->f orward. Using caching, we cannot eliminate the second test unless 
the programmer provides a guarantee that structures are aligned with cache- 
line boundaries. Changing the local path length value removes 6.8 million 
pointer tests and reduces the running time on one processor from 44.4 to 
42.2 seconds. For large numbers of processors, however, the communication 
cost is increased. Consider the list shown in Fig. 6.4. There is a single node 
on Processor 1, in the middle of a very long list on Processor 0. Because the 
list length is very large, the local path length will also be large. Nonetheless, 
using caching for this list traversal is preferable, because there is only one 
cache miss, as opposed to two migrations using computation migration. Since 
the lists in Health resemble the list in Fig. 6.4, the default version, which 
uses caching, performs slightly better for large numbers of processors. On 32 
processors, the running time is L94 seconds using the default version, and 
2.27 seconds using the run-time-computed local path lengths. 
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7. Related Work 



Much work has been done on providing support for programming parallel 
machines. In this section, we describe how our work relates to that of other 
groups. First, we discuss some work by Rajiv Gupta that motivated our work 
on Olden. Then we divide other projects into three categories; object-oriented 
systems, parallel versions of C using fork and join, and other related projects 
not fitting in the first two categories. Figure 7.1 summarizes the differences 
between Olden and the other systems described in this section. 
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Fig. 7.1. Summary of related work 



7.1 Gupta’s Work 

Our work on Olden was motivated originally by Rajiv Gupta’s work sup- 
porting dynamic data structures on distributed memory machines [25]. His 
approach was to create global names for the nodes in a data structure and 
then to apply a variant of run-time resolution [48] to the program. His naming 
scheme assigns a name to every node in a data structure as it is added to the 
structure and makes this name known to all processors (thereby producing 
a global name for the node). The name assigned to a node is determined by 
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its position in the structure, as is the mapping of nodes to processors. For 
example, a breadth-first numbering of the nodes might be used as a naming 
scheme for a binary tree. Once a processor has a name for the nodes in a 
data structure, it can traverse the structure without further communication. 

This method of naming dynamic data structures leads to restrictions on 
how the data structures can be used. Since the name of a node is determined 
by its position, only one node can be added to a structure at a time (for 
example, two lists cannot be concatenated). Also, node names may have to 
be reassigned when a new node is introduced. For example, consider a list in 
which a node’s name is simply its position in the list. If a node is added to 
the front of the list, the rest of the list’s nodes will have to be renamed to 
reflect their change in position. 

Gupta’s decision to use a variant of run-time resolution to handle non- 
local references also has several problems, which are caused by the ownership 
tests that are done to allocate work. First, the overhead of these tests is 
high. And second, they prevent many important optimizations, such as vec- 
torization and pipelining, from being applied. Run-time resolution was never 
intended to be a primary method of allocating work, instead it was designed 
as the fall back position for compile-time resolution, which resolves who will 
do the work at compile time rather than at run time. In Olden, we take a 
much more dynamic approach to allocating work. We still test most refer- 
ences to determine where the work should be done, but each test is done by 
at most one processor and in general the other processors will be busy with 
their own work. 

7.2 Object-Oriented Systems 

Emerald [32] is an object-oriented language developed at the University of 
Washington for programming distributed systems. The language provides 
primitives to locate objects on processors explicitly, and also has constructs 
for moving objects. Each method is invoked on the processor owning that ob- 
ject, migrating the computation at the invocation. Arguments to the method 
may be moved to the invocation site on the basis of compile-time information. 

Amber [17] is a subset of C-h-l- designed for parallel programming with a 
distribution model and mobility primitives derived from Emerald; however, 
all object motion is done explicitly. Amber also adds replication, but only for 
immutable objects. As in Olden, a computation may migrate within an invo- 
cation, but Amber computations only migrate if the objects they reference 
are moved out from under them explicitly by another thread. 

COOL [16] and Mercury [22] are also extensions of C-|--|-. Unlike Emerald 
and Amber, they are designed to run on a shared-memory multiprocessor. 
In COOL, by default, method invocations run on the processor that owns 
the object; however, this can be overridden by specifying affinities for the 
methods. There are three different types of affinities: object, processor, and 
task. If the programmer specifies that a method has affinity with a particular 
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object (not the base object), then it will be run on the processor that owns 
that object. A processor affinity is similar, and could be viewed as an object 
affinity to a dummy object on the specified processor. Object affinities are 
used to increase locality; processor affinities are specified to balance the load. 
To increase cache reuse of an object, COOL allows the programmer to specify 
a task affinity with respect to that object. For each processor, the COOL run- 
time system has an array of task queues. There is a queue for each object 
owned by that processor, for each collection of tasks with task affinity, and 
for the collection of tasks with processor affinity for that processor. Load 
balancing is accomplished by work stealing. If a processor is idle, it may 
steal a task-affinity task queue from another processor, as these tasks are not 
required to run on a particular processor. Mercury also has a notion of object 
affinity, but Tises templates rather than a scheduler. The template associated 
with each method specifies how to find the next task to execute. 

The Concert system [18,33] provides compiler and run-time support for 
efficient execution of fine-grained concurrent object-oriented programs. This 
work is primarily concerned with efficiency rather than language features. 
They combine static analysis, speculative compilation, and dynamic compi- 
lation. If an object type and the assurance of locality can be inferred from 
analysis, method invocations on that object can be in-lined, improving execu- 
tion efficiency. If imprecise information is available, it is sometimes possible to 
optimize for the likely cases, selecting amongst several specialized versions at 
run time. When neither of these is applicable, dynamic compilation is some- 
times used to optimize based on run-time information. The compiler will 
select which program points will be dynamically compiled. Concert provides 
a globally shared object space, common programming idioms (such as RPC 
and tail forwarding), inheritance, and some concurrency control. Objects are 
single threaded and communicate asynchronously through message passing 
(invocations). Concert also provides parallel collections of data, structures for 
parallel composition, first-class messages, and continuations. A major goal of 
the Concert project is to provide efficient support for fine-grain concurrency. 
The system seeks to accomplish this by automatically collecting invocations 
into groups using structure analysis [44]. 

Each of these systems supports more general synchronization than Olden; 
however, they do not provide the automatic selection between caching and 
computation migration. Emerald does have heuristics to move objects auto- 
matically, but does not support replication. Additionally, with the exception 
of Concert, these languages do not support in-lining; consequently, their use 
of objects adds additional overhead. 

7.3 Extensions of C with Fork-Join Parallelism 

Cid [42], a recently proposed extension to C, supports a threads and locks 
model of parallelism. Cid threads are lightweight and the thread creation 
mechanism allows the programmer to name a specific processing element on 
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which the thread should be run. Unlike Olden, Cid threads cannot migrate 
once they have begun execution. This makes it awkward to take advantage 
of data locality while traversing a structure iteratively. Cid also provides a 
global object mechanism that is based on global pointers. The programmer 
explicitly requests access to a global object using one of several sharing modes 
(for example, readonly) and is given a pointer to a local copy in return. 
Cid’s global objects use implicit locking and the run-time system maintains 
consistency. 

One of Cid’s design goals is to use existing compilers. While this makes it 
easier to port to new systems, it does not allow it to take advantage of having 
access to the code generator and compile-time information. For example, once 
a Cid fork is in-lined, the system cannot change its mind. In Olden, an in- 
lined future may be stolen at a later time, should the processor become 
idle. Additionally, Olden programs more closely resemble their sequential C 
counterparts, as handling remote references is done implicitly. 

Cilk [10,11] is another parallel extension of C implemented using a pre- 
processor. The implementation introduces a substantial amount of overhead 
per call, as each call to a Cilk procedure requires a user-level spawn. Unlike 
Cid and Olden, these spawns are never in-lined. 

Cilk introduces dag-consistent shared memory, a novel technique to share 
data across processors. This is implemented as a stack, and shared mem- 
ory allocations are similar to local variable declarations. Caching is done 
implicitly at the page level, and work is placed on processors using a work- 
stealing scheduler. By contrast. Olden does caching at the line- level, and is 
thus more efficient when small portions of a page are shared. Additionally, 
using computation migration. Olden can take advantage of data locality in 
placing the computation. Olden, however, gets its load balance from the data 
layout, and thus is only more efficient for programs where the data is placed 
so that the work is reasonably distributed among the processors. If the data 
is distributed non-uniformly, a Cilk program will perform better, as Cilk’s 
work-stealing algorithm will generate a better load balance. 

7.4 Other Related Work 

Split-C [20] is a parallel extension of C that provides a global address space 
and maintains a clear concept of locality by providing both local and global 
pointers. Split-C provides a variety of primitives to manipulate global pointers 
efficiently. In a related piece of work, Lumetta et al. [39] describe a global 
object space abstraction that provides a way to decouple the description of an 
algorithm from the description of an optimized layout of its data structures. 

Like Olden, Split-C is based on the modification of an existing compiler. 
Split-C, however, adopts a programming model where each processor has a 
single thread executing the same program. While this is perhaps convenient 
for expressing array-based parallelism, recursive programs must be written 
much more awkwardly. 
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Prelude [28] is an explicitly parallel language that provides a computa- 
tion model based on threads and objects. Annotations are added to a Prelude 
program to specify which of several mechanisms — remote procedure call, ob- 
ject migration, or computation migration — should be used to implement an 
object or thread. Hsieh later implemented MCRL [29], which, like Olden, pro- 
vides both computation migration and data migration, and a mechanism that 
automatically selects between them. MCRL uses two heuristics to decide be- 
tween computation and data migration, static and repeat. The static heuristic 
may be computed at compile-time for most functions and migrates compu- 
tation for all non-local writes and data for all non-local reads. The repeat 
heuristic also always migrates computation for non-local writes, but makes a 
run-time decision for non-local reads based on the relative frequency of reads 
and writes. Computation migration is used for an object until two consecu- 
tive reads of that object have occurred without any intervening writes. The 
heuristics used by MCRL are very different from those in Olden, as MCRL 
is designed for a different type of program. In Olden, we are concerned with 
traversals of data structures, and improving locality by migrating the compu- 
tation if it will make future references local. The benchmarks we describe use 
threads that do not interfere. By contrast, MCRL is concerned with reducing 
coherence traffic for programs where the threads are performing synchronized 
accesses to the same regions with a large degree of interference. Olden cannot 
be used for such programs. Their heuristics for choosing between data and 
computation migration would perform poorly on our benchmarks, as they 
consider only the read/ write patterns of accesses. On a read-only traversal of 
a large data structure on a remote processor, MCRL would choose to cache 
the accesses, as this causes no additional coherence traffic. Olden, however, 
would use migration, causing all references but the first to become local. 

Orca [5,6] also provides an explicitly parallel programming model based 
on threads and objects. Orca hides the distribution of the data from the 
programmer, but is designed to allow the compiler and run-time system to 
implement shared objects efficiently. The Orca compiler produces a summary 
of how shared objects are accessed that is used by its run-time system to 
decide if a shared object should be replicated, and if not, where it should 
be stored. Operations on replicated and local objects are processed locally; 
operations on remote objects are handled using a remote procedure call to 
the processor that owns the object. Orca performs a global broadcast at each 
fork. This restricts Orca to programs having coarse parallelism, and makes 
it awkward to express parallelism on recursive traversals of trees and DAGs. 
Also, Orca does not allow pointers, instead using a special graph type. This 
has the disadvantages of making it impossible to link together two different 
structures without copying one and also requiring that each dereference be 
done indirectly through a table. 

Carriero et al.’s [15] work on distributed data structures in Linda shares a 
common objective with Olden, namely, providing a mechanism for distributed 
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processors to work on shared linked structures. In the details, however, the 
two approaches are quite different. The Linda model provides a global address 
space (tuple space), but no control over the actual assignment of data to 
processors. While Linda allows the flexibility of arbitrary distributed data 
structures, including arrays, graphs and sets. Olden’s control over data layout 
will provide greater efficiency for programs using hierarchical structures. 

Jegou [31] uses the idea of computation migration to provide an environ- 
ment for executing irregular codes on distributed memory multiprocessors. 
His system is implemented as an extension of FORTRAN. For each parallel 
loop, a single thread is started for each iteration. The data never moves; in- 
stead, when a thread needs non-local data, it is migrated to the processor 
that owns the data. The threads operate entirely asynchronously (i.e., no 
processor ever needs to wait for the result of a communication). This allows 
computation to be overlapped with communication. Because the threads mi- 
grate, detecting when the loop has terminated is non-trivial. Jegou provides 
an algorithm for detecting termination that requires 0{p^) messages. Using 
computation migration alone for these types of programs is successful because 
there is sufficient parallelism to hide the communication latency; however, for 
divide-and-conquer programs, we have demonstrated that combining compu- 
tation migration with software caching provides better performance. 



8. Conclusions 

We have presented a new approach for supporting programs that use pointer- 
based dynamic data structures on distributed-memory machines. In develop- 
ing our new approach, we have noted a fundamental problem with trying 
to apply run-time-resolution techniques, currently used to produce SPMD 
programs for array-based programs, to pointer-based programs. Array data 
structures are directly addressable. In contrast, dynamic data structures must 
be traversed to be addressable. This property of dynamic data structures pre- 
cludes the use of simple local tests for ownership, and therefore makes the 
run-time resolution model ineffeetive. 

Our solution avoids these fundamental problems by matching more closely 
the dynamic nature of the data structures. Rather than having a single thread 
on each processor, which decides if it should execute a statement by determin- 
ing if it owns the relevant piece of the data structure, we instead have multiple 
threads of computation, which are allowed to migrate to the processor own- 
ing the data they need. Along with this computation migration technique we 
provide a futurecall mechanism, which introduces parallelism by allowing 
processors to split threads, and a software caching mechanism, which pro- 
vides an efficient means for a thread to access data from multiple processors 
simultaneously. In addition, we have implemented a compiler heuristic re- 
quiring minimal programmer input that automatically chooses the whether 
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to use computation migration or software caching for each pointer derefer- 
ence. We have also built a profiler, which will compute the local-path-length 
information used by the heuristic, and allow the programmer to determine 
which lines of the program cause communication events. 

We have implemented our execution mechanisms on the CM-5, and have 
performed experiments using this system on a suite of eleven benchmarks. 
Our results indicate that combining both computation migration and soft- 
ware caching produces better performance than either mechanism alone, and 
that our heuristic makes good selections with minimal additional information 
from the programmer. The Olden profiler can generate this information au- 
tomatically by examining a sample run of the program. Where comparisons 
were available, our system’s performance was comparable to implementations 
of the same benchmarks using other systems; however, the Olden implemen- 
tation was more easily programmed. 
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1. Introduction 

There have been major efforts in developing programming language and com- 
piler support for distributed memory parallel machines. On these machines, 
large data arrays are typically partitioned among the local memories of in- 
dividual processors. Languages supporting such distributed arrays include 
Fortran D [11,19], Vienna Fortran [6,32], and High Performance Fortran 
(HPF) [21]. Many compilers for these HPF-like languages produce Single 
Program Multiple Data (SPMD) code, combining sequential code for op- 
erating on each processor’s data with calls to message-passing or runtime 
communication primitives for sharing data with other processors. 

Reducing communication costs is crucial to achieve good performance 
on applications [16, 18]. Current compiler prototypes, including the For- 
tran D [19] and Vienna Fortran compilation systems [6], apply message block- 
ing, collective communication, and message coalescing and aggregation to op- 
timize communication. However, these methods have been developed mostly 
in the context of regular problems, meaning for codes having only easily an- 
alyzable data access patterns. Special effort is required to develop compiler 
and runtime support for applications with more complex data access. 

In irregular problems, communication patterns depend on data values not 
known at compile time, typically because of some indirection in the array ref- 
erence patterns in the code. Indirection patterns have to be preprocessed, and 
the sets of elements to be sent and received by each processor precomputed, 
in order to reduce the volume of communication, reduce the number of mes- 
sages, and hide communication latencies through prefetching. An irregular 
loop with a single level of indirection can be parallelized by transforming 
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it into two constructs — an inspector and an executor [24], During program 
execution, the inspector running on each processor examines the data ref- 
erences made and determines which off-processor data must be fetched and 
where the data will be stored once received. The executor loop then uses the 
information from the inspector to implement the actual computation and 
communication, optimized to reduce communications costs wherever possi- 
ble. Supporting the inspector- executor pair requires a robust runtime system. 
The Maryland CHAOS library [10] provides such support, including proce- 
dures for analyzing indices, generating communication schedules, translating 
global addresses to local addresses and prefetching data. 




Fig. 1.1. Simple Irregular Loop 



Figure 1.1 shows a loop with a single level of indirection. In this example, 
assume that all the arrays are aligned and distributed identically by blocks 
among the processors, and that the iterations of the i loop are similarly 
block partitioned. The resulting computation mapping is equivalent to that 
produced by the owner computes rule, a compiler heuristic that maps the 
computation of an assignment statement to the processor that owns the left 
hand side reference. Data array y is indexed using the array ia, causing a 
single level of indirection. Standard methods for compiling the loop for a 
distributed-memory machine generate a single inspector-executor pair. The 
inspector analyzes the subscripts, a gather- type communication occurs if off- 
processor data is read in the loop body, a single executor runs the original 
computation, and then a scatter-type communication occurs if off-processor 
data has been written in the loop body. 

The compiler produces SPMD code designed to run on each processor, 
with each processor checking its processor identifier to determine where its 
data and loop iterations fit into the global computation. Letting my$elems 
represent the number of local elements from each array (and the number of 
iterations from the original loop performed locally) the compiler generates 
the following code (omitting some details of communication and translation 
for clarity); 

do i = 1, my$elems ! inspector 

index$y(i) = ia(i) 

enddo 



. . . fetch y elements to local memory, 
modify index$y for local indices . . . 







Runtime and Compiler Support for Irregular Computations 753 



do i = 1, my$elems ! executor 

x(i) = y (index$y (i) ) + z(index$z(i)) 
enddo 

Because the references to y access only local elements of the distributed array 
ia, the inspector requires no communication. 

Many application codes contain computations with more complex array 
access functions. Subscripted subscripts and subscripted guards can make the 
indexing of one distributed array depend on the values in another, so that 
a partial order is established on the distributed accesses. Loops with such 
multiple levels of indirection commonly appear in unstructured and adap- 
tive application codes associated with particle methods, molecular dynamics, 
sparse linear solvers, and in some unstructured mesh computational fluid 
dynamics solvers. 

In this paper we present various optimizations that are part of our runtime 
support system and methods for handling loops with complex indirection 
patterns, by transforming them into multiple loops each with a single level of 
indirection. We have implemented these methods in the Fortran D compiler 
developed at Rice University [17]. Our experiments demonstrate substantial 
performance improvements through message aggregation on a multiprocessor 
Intel iPSC/860 and through vectorization on a single Cray Y-MP processor. 

In Section 2 we present an overview of the runtime system we have de- 
veloped and also also discuss the various optimization techniques required 
to successfully compile irregular problems for distributed memory machines. 
Section 3 describes the compiling techniques and transformations that we 
have developed. In Section 4 we present results showing the benefits of the 
various runtime optimizations and also performance results from application 
codes compiled using the compilation techniques. We conclude in Section 5. 



2. Overview of the CHAOS Runtime System 

This section describes the CHAOS runtime support library, which is a su- 
perset of the PARTI [10] library, along with two new features, light-weight 
schedules and two-phase schedule generation, which are designed to handle 
adaptive irregular programs. 

The CHAOS runtime library has been developed to effieiently support 
parallelization of programs with irregular data access patterns. The library is 
designed to ease the implementation of computational problems on parallel 
architecture machines by relieving users of low-level machine specific issues. 

In static irregular programs, the executor is typically performed many 
times, while partitioning (data and work) and inspector generation are per- 
formed only once. In some adaptive programs, where data access patterns 
change periodically but reasonable load balance is maintained, the inspec- 
tor generation process must be repeated whenever the data access patterns 
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change. In highly adaptive programs, the data arrays may need to be repar- 
titioned in order to maintain load balance. In such applications, all phases of 
the inspector are repeated upon repartitioning. 

The optimizations developed to handle static irregular problems have been 
presented in detail in earlier work [10]. We now present the optimizations 
required to handle adaptive problems - irregular problems where the data 
access patterns change every few loop iterations. In such cases, the inspector 
generation process has to be modified to minimize its cost, and we present 
two different ways of achieving that goal. 

Adaptive Schedule Generation 

A communication schedule is used to fetch off-processor elements into a local 
buffer before the computation phase, and to scatter these elements back to 
their home processors after the computational phase is completed. Commu- 
nication schedules determine the number of communication startups and the 
volume of communication. Therefore, it is important to optimize the schedule 
generation. 

The basic idea of the inspector-executor concept is to separate prepro- 
cessing of access patterns from communication and computation. Each piece 
of the transformed loop can then be optimized in the appropriate way. In- 
spector preprocessing can be combined with other preprocessing and hoisted 
out of as many loops as possible (those in which the data access information 
does not change), communication can be batched and vectorized, and the 
computation can be tuned in a tight inner loop. 

In adaptive codes where the data access pattern changes occasionally, the 
inspector is not a one-time preprocessing cost. Every time an indirection array 
changes, the schedules associated with the array must be regenerated. While 
this paper focuses on moving inspectors out of loops to reduce their frequency 
of execution, we have developed other techniques for efficient incremental 
inspection [10,26]. 

Eor example, in Eigure 2.1, if the indirection array ic is modified, 
the schedules incsched-c and sched-c must be regenerated. Generating 
inc-sched-c involves inspecting sched-ob to determine which off-processor 
elements are duplicated in that schedule. Thus, it must be certain that com- 
munication schedule generators are efheient while maintaining the necessary 
flexibility. 

In CHAOS, the schedule-generation process on each processor is carried 
out in two distinct phases. 

— The index analysis phase examines the data access patterns to deter- 
mine which references are off-processor, removes duplicate off-processor 
references by storing information about distinct references in a hash table, 
assigns a local buffer for storing data from off-processor references, and 
translates global indices to local indices. 
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LI: do n = 1, nsteps ! outer loop 

call gather (y (begiu-bujf ) , y, sched_ab) ! fetch off-proc data 
call zero-Out-buffer(x(beginJ)uff), offp-x) ! initialize buffer 
L2: do i = 1, local_sizeof_indir_arrays ! inner loop 

x(local_ia(i)) = x(local_ia(i)) 

+ y(localJa(i)) * y(local_ib(i)) 

end do 

S: if (required) then 

modify part_ic(:) ! ic is modified 
CHAOS_clear_mask(hashtable, stamp_c) ! clear ic 

local_ic(:) = part_ic(:) 

stamp_c = CHAOS_enter_hash(localJc) ! insert new ic 
inc_sched_c = CHAOS_incremental_schedule(stamp_c) 

! sched for ia, ic 

sched_ac = CHAOS_schedule(stamp_a, stamp_c) 
endif 

! incremental gather 

call gather (y(begiri-buff 2), y, incsched-c) 

call zero-OutJ)uffer(x(beginJ)uff2), offp-x2) ! initialize buffer 
L3: do i = 1, local_sizeof_ic ! inner loop 

x(local_ic(i)) = x(local_ic(i)) + y(localJc(i)) 
end do 

call scatter_add(x(begin.buff), x, sched.ac) ! scatter addition 
end do 

Fig. 2.1. Schedule Generation for an Adaptive Program 



— The schedule generation phase generates communication schedules based 
on the information stored in the hash tables for each distinct off-processor 
reference. 

The communication schedule for processor p stores the following information: 

1. send list - a list of arrays that specifies the local elements on processor 
p required by all other processors, 

2. permutation list - an array that specifies the data placement order of 
off-processor elements in the local buffer of processor p, 

3. send size - an array that specifies the sizes of out-going messages from 
processor p to all other processors, and 

4. fetch size - an array that specifies the sizes of in-coming messages to 
processor p from all other processors. 



The principal advantage of such a two-step process is that some of the 
index analysis can be reused in adaptive applications. In the index analysis 
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phase, hash tables are used to store global to local translation and to remove 
duplicate off-processor references. Each entry keeps the following information: 

1. global index - the global index hashed into the table, 

2. translated address - the processor and offset where the element is stored 

3. local index - the local buffer address assigned to hold a copy of the 
element, if it comes from off-processor, and 

4. stamp - an integer used to identify which indirection array inserted the 
element into the hash table. The same global index entry might be hashed 
into the table for many different indirection arrays; a bit in the stamp is 
marked for each indirection array. 

Stamps are useful for efficiently parallelizing adaptive irregular programs, 
especially for those programs with several index arrays where most of them 
do not change. In the index analysis phase, each index array inserted into the 
hash table is assigned a unique stamp that marks all its entries in the table. 
Communication schedules are generated based on the combination of stamps 
for an array reference. If any one of the index arrays changes, only the entries 
pertaining to the index array, namely those entries with the stamp marked 
for the index array, have to be removed from the hash table. Once the new 
index array entries are inserted into the hash table, a new schedule can be 
generated without rehashing the other index arrays. 

Figure 2.1 illustrates (in pseudo-code) how CHAOS routines are used to 
parallelize an adaptive problem. The conditional statement S may cause the 
indirection array ic to be modified. Whenever this occurs, the communication 
schedules that involve prefetching references of ic must be modified. Since 
the values of ic in the hash table are no longer valid, the entries with stamp 
stamp-C are cleared by calling CH AOS -dear _mask{). New values of ic are 
then inserted into the hash table by the call to CH AOS -enter Jiash{). Af- 
ter all indirection arrays have been hashed, communication schedules can be 
built for any combination of indirection arrays by calling CHAOSschedule{) 
or CH AOS -incremental -ScheduleQ with an appropriate combination of 
stamps. 

An example of schedule generation for two processors with sample val- 
ues for indirection arrays ia, ib, and ic is shown in Figure 2.2. The global 
references from indirection array ia are stored in hash table H with stamp 
a, ib with stamp b and ic with stamp c. The indirection arrays might have 
some common references. Hence, a hashed global reference might have more 
than one stamp. The gather schedule sched-ab for the loop L2 in Figure 2.1 
is built using the union of references with time stamps a or b. The scatter 
operation for loop L2 can be combined with the scatter operation for the loop 
L3. The gather schedule incsched-c for loop L3 is built with those references 
that have time stamp c only, because references that also have time stamps 
a or 6 have been fetched using the schedule sched-ab. The scatter schedule 
for loops L2 and L3 is built using the union of references with time stamps 
a and c. 
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Initial distribution of data arrays 
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Generating communication schedules from the hash table 



schcd_a = 

sched_b = 

inc_sclied_b = 
schcd_c — 

i nc sched c = 



schedule ( stamp = a) 
will gathcr/scattcr 7,9 

schedule ( stamp = b) 
will gather/scattcr 7,8 

incremental schedule( base=a, stamp =b) 

will gather/scaller 8 

schedule ( stamp — c) 
will gather/seatter 9,8, lO 

incremental schedule( base=a,b, stamp =c) 

will gather/scatter lO 



Fig. 2.2. Schedule Generation on One Processor, Using a Hash Table 



PARTI, the runtime library that preceded CHAOS, also had support for 
building incremental and merged schedules [10]. However, in PARTI such 
schedules were built using functions specialized for these purposes. The 
CHAOS library restructures the schedule generation process by the using 
a global hash table that provides a uniform interface for building all types of 
schedules. Such a uniform interface is easier to use for both parallel applica- 
tion developers and for compilers that automatically embed CHAOS schedule 
generation calls. 

Light-Weight Schedules 

In certain highly adaptive problems, such as those using particle- in-cell meth- 
ods, data elements are frequently moved from one set of elements to another 
during the course of the computation. The implication of such adaptivity 
is that preprocessing for a loop must be repeated whenever the data access 
pattern of the loop changes. In other words, previously built communication 
schedules cannot be reused and must be rebuilt frequently. 

In such applications a significant optimization in schedule generation can 
be achieved by recognizing that the semantics of set operations imply that set 
elements can be stored in any order. This information can be used to build 
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light-weight communication schedules that cost less to build than normal 
schedules. During schedule-generation, processors do not have to exchange 
the addresses of all the elements they will be accessing from other processors; 
they only need to exchange information about the number of elements they 
will be appending to each set of elements. This optimization greatly reduces 
the communication costs for schedule generation. A light-weight schedule for 
processor p stores the following information: 

1. send list - a list of arrays that specifies the local elements on processor 
p required by all other processors, 

2. send size - an array that specifies the out-going message sizes from pro- 
cessor p to all other processors, and 

3. fetch size - an array that specifies the in-coming message sizes on pro- 
cessor p from all other processors. 

Thus, light-weight schedules are similar to the previously described sched- 
ules, except that they do not carry information about data placement order 
on the receiving processor. While the cost of building a light-weight schedule 
is less than that of regular schedules, a light-weight schedule still provides the 
communication optimizations of aggregating and vectorizing messages [10]. 

While the routines in the CHAOS library, including light-weight sched- 
ules and two-phase schedule generation, can be used directly by application 
programmers, the library can also be used as runtime support for compilers. 
Previous work has concentrated on using the routines to effectively compile 
irregular applications with a single level of indirection [14,26,31]. We now 
present a solution to the problem of extending that work to compiling more 
complex applications with multiple levels of indirection. 



3. Compiler Transformations 

Standard methods for compiling regular accesses to distributed arrays gen- 
erate a single inspector-executor pair. The inspector analyzes the subscripts, 
a gather-type communication occurs if off-processor data is read, a single ex- 
ecutor runs the original computation, and then a scatter-type communication 
occurs if off-processor data has been written. 

This approach suffices for irregular access if the inspector can determine 
the array indexing pattern without communication. Consider the loop in 
Figure 1.1, but now assume that, while all the other arrays and the loop iter- 
ations are block-distributed so that x(i), ia(i), and y(i) are all assigned to the 
processor that executes iteration i, array ^ has been cyclic-distributed so that 
z(i) usually lies on a different processor. In addition to the potential irregular 
off-processor references to y, we now have regular off-processor references to 
z. 

The compiler will produce SPMD code designed to run on each proces- 
sor, checking the processor identifier to determine where its local data and 
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assigned iterations fit into the global computation. Letting my$elems rep- 
resent the number of local elements from each array (and the number of 
iterations from the original loop performed locally) and my$ojfset represent 
the amount that must be added to the local iteration numbers and array 
indices to obtain the global equivalents, we get the following code (omitting 
some details of communication and translation for clarity): 

do i = 1, my$elems ! inspector 

index$z(i) = i + my$offset 
index$y(i) = ia(i) 
enddo 

. . . fetch y and z elements to local memory, 
modify index$z and index$y for local indices . . . 

do i = 1, my$elems ! executor 

x(i) = y (index$y(i) ) + z (index$z (i) ) 
enddo 

Because the array references for y access only local elements of the distributed 
array ia, the inspector for y requires no communication and can be combined 
with the inspector for 

However, if an inspector needs to make a potentially non-local reference 
(either because of a misaligned reference or from multiple levels of indirec- 
tion) , the single inspector-executor scheme breaks down. The inspector must 
itself be split into an inspector-executor pair. Given a chain of n distributed 
array references, each reference in the chain depending on the previous ref- 
erence, we must produce n -|- 1 loops: one initial inspector, n (S> 1 inspectors 
that also serve as the executor for the previous inspector, and one executor to 
produce the final result (s). We can generate these loops in the proper order 
with the aid of the slice graph representation we will define in Section 3.2 
— each such slice represents a loop that must be generated, and the edges 
between slices represent the dependences among them. 

Figure 3.1 shows three ways that a reference to a distributed array x 
can depend on another distributed array reference ia, and that reference 
may depend on yet another distributed array reference. In the figure, the 
dependence of indirection array ia on another array may result from the 
use of subscripted values in subscripts, as in example A; from conditional 
branches, as in example B; or from loop bounds, as in example C. These 
three types of dependences may be combined in various ways in a program 
to produce arbitrary chains of dependences. 

3.1 Transformation Example 

The loop in Figure 3.2 represents a kernel from a molecular dynamics pro- 
gram. The first three statements specify the data and work distribution. The 
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do i = 1, n 
x(ia(ib(i))) = ... 
enddo 



do i = 1 , n 
if(ic(i)) then 
x(ia(i)) = .. 
endif 
enddo 



do i = l,n 
do j - ia(i), ia(i+l) 
x(ia(j)) = ... 
enddo 
enddo 



ABC 

Fig. 3.1. Dependence Between Distributed Array Accesses. 



data arrays {x and y) are distributed by blocks, while the integer arrays {ia 
and ib are irregularly distributed using a map-array [11] to specify the place- 
ment of each array element. The ON_HOME clause [12] indicates that loop 
iteration i will be mapped to the same processor as x(i) {i.e., the iterations 
will also be block-partitioned). In this loop, there are array accesses with 
inspection level 1 (ia), 2 (ib) and 3 (y). 



DO DISTRIBUTE x and y by BLOCK 
D1 DISTRIBUTE ia and ib using a map -array 

D2 EXECUTE (i) ON_HOME x(i) 

LO DO i = l,n 

50 if(ia(i)) then 

51 x(i) = x(i) -h y(ib(i)) 

52 end if 

53 end do 



Fig. 3.2. Example Loop 



The code generated for Figure 3.2 requires a chain of several inspectors 
and executors. An executor obtains the values for ia. The values in ia are 
used to indicate which subscript values of ib are actually used. The executor 
obtaining values of ib provides the subscripts for y that are actually used. 
Finally, an executor is required for obtaining the values of y. Each inspector- 
executor step, with intervening communication and localization, handles one 
level of non-local access. Frequently, as in this example, the executor of one 
step can be combined with the inspector for the next. For the example, exactly 
three inspectors are required to obtain the required non-local elements of y. 

The transformation required to eliminate non-local array references in 
loops can be divided into two distinct parts. 
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— The first part of the transformation process breaks a loop whose references 
have multiple inspection levels into multiple loops whose references have 
no more than one inspection level. Each non-local reference becomes a 
distributed array reference indexed by a local index array. Figure 3.3 shows 
this intermediate stage in transforming the example loop of Figure 3.2. 
When it is possible, the executor producing the values from one array is 
merged with the inspector for the subscripts where those values are used. 

— The second part of the transformation completes the inspector-executor 
pairs for each of the loops or code segments generated. For the completion 
of the inspector-executor pairs we insert runtime library calls for collective 
communication and for global to local address translation before each of 
the executor loops (TL2, TL3 and TL4) shown in Figure 3.3 [14,26,31]. 

The transformation of the example code shown in Figure 3.2 results in the 
code shown in Figure 3.3. The first loop TLl in Figure 3.3 is created to obtain 
the global iteration numbers for the i-loop and the result is stored in local 
array indexSia. This is part of the inspector for the access to ia, determining 
the global indices of the accessed elements. 

Loop TL2 is a typical loop with data accessed though a single level of 
indirection ia(index$ia(i)). The distributed array ia is accessed by an in- 
direction array that contains global indices. After execution of this loop the 
local array index$arr contains the global iteration numbers that take the true 
branch of the if-condition. The variable index$cntr contains the number of 
times the true branch of the if-condition is executed. A communication and 
localization phase will later be inserted between TLl and TL2 to complete 
the inspector-executor pair for ia’s single level of indirection. Among other 
things, that phase will convert index$ia from global indices to local indices, 
including indices for local copies of off-processor elements. 

The next loop, TL3, is a loop in which data is accessed through a single 
level of indirection. The loop bounds have been changed so that the loop is 
executed the number of times the the true branch of the if-condition is taken 
(indexScntr) . The values stored in indexSy are the global indices of the arrays 
y that are accessed in statement SI in Figure 3.2. Again a communication 
and localization step will later be inserted between TL2 and TL3 to complete 
the inspector-executor pair for ib. 

The loop TL4 becomes the executor for the original loop in Figure 3.2. 
Both distributed arrays x and y are accessed by local indirection arrays. The 
inspector-executor pair for the original loop is now complete. 

The example shows that, given a general irregular loop with multiple 
levels of indirection, we can reduce it to a sequence of loops in which the 
distributed arrays are only indexed by local arrays. In the original loop, if 
there are distributed array references in control statements (in the example), 
they can also be transformed to conditionals that are only dependent on local 
array values. 
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nSproc 

mySproc 

my$elems 

my$offset 



number of processors 

^ — my processor number [0 .. (n$proc - 1)] 

number of elements on processors [n/n$nproc] 

^ global offset of first element on each processor [my$elems * my$proc] 



TLl DO i = 1, mylelems 

index$ia(i) = i + mySoffset 
end do 



Get Indices of Global ia 
Accessed on Local Iterations 



TL2 index$cntr = 0 

DO 1 = 1, my$elems 
if (ia(index$ia(i))) then 
index$cntr = index$cntr + 1 
index$arT(index$cntr) = i + myjoffset 
end if 
end do 



Get Indices of Global ib 
Accessed on Local Iterations 



TL3 DO i = 1, index$cntr 

index$y(i) = ib(index$arr(i)) 
end do 



Get Indices of Global y 
Accessed on Local Iterations 



TL4 DO i = 1, index$cntr 

x(index$arr(i)) = x(index$arr(i)) + y(index$y(i)) 
end do 



The Actual loop Computation 



Fig. 3.3. Transformation of Example Loop. 
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In this paper, we present the algorithms required to perform the first part 
of the transformation, namely breaking a loop requiring multiple inspection 
levels into multiple loops requiring no more than one inspection level. The 
second part of the transformation, insertion of collective communication, is 
discussed elsewhere [13]. We have shown the generation of the intermediate 
code using the loop depicted in Figure 3.2. While for many applications the 
index arrays are much larger than the data arrays, for simplicity of presen- 
tation we assume that all distributed arrays are the same size. 

3.2 Definitions 

This section introduces some concepts that will be useful for describing the 
transformation algorithms presented in Section 3.3. 

3.2.1 Program Representations. We represent programs to which the 
transformations will be applied as abstract syntax trees (ASTs), which en- 
code the source-level program structure. In addition, we use three standard 
auxiliary representations: a control-flow graph (CFG) [1], a data-flow graph 
(DFG) represented in static single-assignment form [7], and control depen- 
dences. 

We label each relevant expression and variable reference with a value 
number. Two nodes with the same value number are guaranteed to have the 
same value during program execution [2,15]. 

3.2.2 Transplant Descriptor. In flattening multi-level index expressions, 
we extract complicated subscript computations, replicate them outside their 
original loops, and save the sequence of subscripts in a local index array. The 
subscript computations must be copied, transplanted back into the original 
program and the values saved, all without disturbing other computations. 

The transplant deseriptor is a data structure that contains information 
to replicate the value of an expression at another point in the program. The 
replicated computation is built using program slicing techniques. In the lit- 
erature, a slice is a subset of the statements in a program that is determined 
to affect the value of a variable at a particular point in a program [30] . Our 
method builds each slice as a derivative program fragment, tailored to com- 
pute the values of interest and ready to be transplanted elsewhere in the 
program. 

A transplant descriptor t for a program P is a composite object 
t = {vt,ct,tt,it,dt[,nt]) 

containing 

Vt : a value number for the value of interest - used primarily to avoid inserting 
redundant code for equivalent computations 
Ct'. the slice - a sequence of statements from the AST of P required to re- 
compute the desired sequence of values, represented by node indices in 
the AST (ct resembles a dynamic backward executable slice [29]) 
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tt : the target - a location in P where ct can safely be placed without changing 
the meaning of P 

it', the identifier for a local index array which, after execution of q, will 
contain the index values of the subscript whose value number is Vt 
dt- a set of AST node indices for subscripts on which the slice Ct depends, 
that will also need preprocessing and transplanting 
n*: if we compute counting transplant descriptors, this is the value number 
for the size of the subscript trace stored in it 

We construct two varieties of transplant descriptors: 

— A collecting transplant descriptor saves in it the sequence of values (the 
trace) that a subscript assumes during program execution. 

— A counting transplant descriptor calculates the size of the subscript trace 
that will be generated during the execution of a particular collecting trans- 
plant descriptor. A counting transplant descriptor is needed if generating 
a collecting transplant descriptor will require the size of the trace to be 
recorded, for example for preallocating a data structure to store the trace. 

If t is a collecting transplant descriptor, then vt will be the value number 
of an AST subscript at of a nonlocal array reference arr{at) in P, and it will 
store the sequence of all subscripts accessed during execution of P. Note that 
the length of this sequence depends on the location where the trace will be 
placed in the program, which is given by tt- For example, if tt is the statement 
for the reference itself, then it contains only a single subscript. If tt is outside 
the loop enclosing the array reference, then it contains the subscripts for all 
iterations of the loop. 

If we compute counting transplant descriptors, then n* will be the value 
number of the counter indexing it after execution of c* is finished; in other 
words, the value of n* will be the size of the subscript trace computed for it. 

If t is a counting transplant descriptor, then there exists a collecting trans- 
plant descriptor d for which vt = nj, and tt = td- it stores the size of the 
subscript trace computed for id- Since it corresponds to a single value, n* 
will be the value number corresponding to the constant “1.” ^ 

The dt stored in each transplant descriptor is a set of AST node indices 
for subscripts of references that need runtime processing. Only the references 
in Ct that require runtime processing are considered when dt is created. 

A Slice Graph is a directed acyclic graph 

G={T,E) 

that consists of a set of transplant descriptors T and a set of edges E. For 
t, d T, an edge e = {t, d) E establishes an ordering between t and d. The 
edge e in the graph implies that Cd contains a direct or indirect reference 

^ tt must equal td because otherwise we might count too many (for tt preceding 
td) or too few (for tt succeeding td) subscripts. 




Runtime and Compiler Support for Irregular Computations 765 



to it, and therefore must be executed after ct- G must be acyclic to be a 
valid slice graph. Note that the edges in the slice graph not only indicate a 
valid ordering of the transplant descriptors, but they also provide information 
for other optimizations. For example, it might be profitable to perform loop 
fusion across the code present in the transplant descriptors; the existence of 
an edge between nodes, however, indicates that the code in the corresponding 
transplant descriptors cannot be fused. 

A Subscript Descriptor 

S = {Vs,ts) 

for the subscript Og of some distributed array reference consists of the value 
number of Og, Vg, and the location in P where a transplant descriptor gener- 
ated for Ug should be placed, tg. The transformation algorithm will generate 
a transplant descriptor for each unique subscript descriptor corresponding 
to a distributed array reference requiring runtime preprocessing. Identifying 
transplant descriptors via subscript descriptors is efficient, in that it allows 
the trace generated for a transplant descriptor to be reused for several refer- 
ences, possibly of different data arrays, if the subscripts have the same value 
number. This optimization is conservative in that it accounts for situations 
where different references might have the same subscript value number, but 
different constraints as far as prefetch aggregation goes, corresponding to 
different target locations in the program. 

3.3 Transformation Algorithm 

This section describes the algorithm used to carry out the transformations 
described in Section 3.1. The algorithm consists of two parts. The first part, 
described in Section 3.3.1, analyzes the original program and generates trans- 
plant descriptors and the slice graph. The second part, described in Sec- 
tion 3.3.2, uses these data structures to produce the transformed program. 

3.3.1 Slice Graph Construction. Building the transplant descriptors re- 
quires a list of all non-local distributed array references that require runtime 
preprocessing. We assume for ease of presentation that all references to dis- 
tributed arrays qualify; in practice some reference patterns permit simpler 
handling. 

The procedure Generatejslice_graph() shown in Figure 3.4 is called 
with the program P and the set of subscripts R for the non-local array refer- 
ences. The procedure first generates all the necessary transplant descriptors 
and then finds the dependences between them. It returns a slice graph con- 
sisting of a set of transplant descriptors T and edges E. 

The Foreach statement labeled A4. . .A9 computes a subscript descriptor 
s = {vg,tg) for every requested AST index Og. The routine 
Lookup_val_number(as) computes a value number on demand [15]; sub- 
script descriptors r and s, with Vr = Vg, represent values that can be com- 
puted with the same slice. Preview_slice Jnputs() uses a simplified version 
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Procedure Generate_slice_graph(P, R) 

/ 1 P'. Program to be transformed 

IIP'- AST indices for subscripts of non-local references 

Al T := // Transplant Descriptors 

A2 E := II Transplant Descriptor ordering edges 

A3 := // Subscript descriptors 

/ / Compute subscript descriptors. 

A4 Foreach R 

A5 Vs '■= Lookup_val_number(as) 

A6 L := Preview_sliceJnputs(as) 

A7 ts := Gen_target(o<j) 

AS U :=U -^%Vs,ts)<> 

A9 Endforeach 

/ / Compute Transplant Descriptors. 

AlO Foreach s U 

All t := Gen_t_descriptor(s) 

A12 T :=T 

// The following steps are executed 

II iff counting transplant descriptors are required. 

01 d := Lookup_t_descriptor(T, (rit, ij)) 

02 If t = Then 

03 d := Gen_t_descriptor(nt, tt) 

04 T := T 

05 E := E ^%d,t)<> 

06 Endif 

A13 Endif 

A14 Endforeach 

/ / Compute edges resulting from 
/ / dependence sets of transplant descriptors. 

A15 Foreach t T 

A16 Foreach Os dt 

A17 Vs '.= Lookup_val_number(as) 

A18 tg := Lookup_target(Os) 

A19 d := Lookup_t_descriptor(T, G)) 

A20 E-.= E^%d,t)<i 

A21 Endforeach 

A22 Endforeach 

A23 Return (T, E) 



Fig. 3.4. Slice graph generation algorithm. 
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of the slice construction algorithm we will describe shortly, to build a list L 
of the array variables that will be read by the slice. Gen_target() selects a 
target location where the slice for Vg can safely be placed; the target tg must 
satisfy several constraints: 

1. tg must be guaranteed to execute before a^; more precisely, tg must pre- 
dominate Qg in the control-flow graph [22], 

2. no array variable in the list L of slice inputs may be modified between tg 
and Qg, and 

3. tg should execute as infrequently as possible (it should be chosen with as 
few surrounding loops as possible). 

These constraints can be satisfied by first placing tg immediately before the 
statement containing a^, then moving tg outside of each enclosing loop until 
either there are no loops enclosing tg or the loop immediately enclosing tg 
contains statement modifying an array in L. 

The next Foreach statement, labeled AlO. . .A14, builds a transplant de- 
scriptor t and, if necessary, a collecting transplant descriptor d for every sub- 
script descriptor s U. Gen_t_descriptor() takes a subscript descriptor 
s = {vg,tg) and builds t = {vt,ct, tt, it,dt) with 

— Vt = Vg,tt = tg, 

— ct the code for the slice to compute the sequence of subscript values, 

— it an identifier for a new processor-local array variable that will store the 
values computed by the ct, and 

— dt a set of subscript AST indices for other array references that are input 
values for c* {i.e., the subscripts for the variables in L). 

The slice ct is built by following backwards along data-flow graph and con- 
trol dependence edges to find the control-flow graph nodes (statements) that 
contribute to the subscript value. When a distributed array reference is en- 
countered on a path, building the slice along that path stops and the array 
reference is added to ct and the corresponding subscript is added to dt- 
If we are interested in the size of the subscript trace recorded in t {e.g., 
for allocating trace arrays), then the statements labeled 01... 06 compute 
a counting transplant descriptor d for each such t. However, different col- 
lecting transplant descriptors can share a counting transplant descriptor if 
they have the same counter value number n® and target location tg. There- 
fore, we must first examine the set of already created transplant descriptors. 
Lookup_t_descriptor() takes as input a set of transplant descriptors T and 
a subscript descriptor s, and returns the transplant descriptor d T corre- 
sponding to s if there exists such an d; otherwise, it returns . If a counting 
transplant descriptor has not yet been created, a new counting transplant de- 
scriptor d is generated. Since the code in the counting transplant descriptor d 
must be executed before the code in the collecting transplant descriptor t, a 
directed edge (d, t) must be added to the edge set E. 
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The nested Foreach statements labeled A15. . .A22 are used to find the 
directed edges resulting from the dependence sets in each transplant descrip- 
tor. The outer Foreach iterates through the transplant descriptors T and the 
inner Foreach iterates through the references Ug stored in the dependence set 
dt of t. All the relevant information has already been generated, so these loops 
must only consult previously built data structures to construct the complete 
set of edges. 

3.3.2 Code Generation. The code generation algorithm is shown in Fig- 
ure 3.5. The procedure Gen_code() takes as input the original program P 
and the slice graph consisting of transplant descriptors T and their order- 
ing E. Gen_code() traverses the program and modifies the subscripts of all 
distributed array references that require runtime preprocessing, and gener- 
ates the required communication. 



Procedure Gen_code(P, T, E) 

Cl Topological_sort(T, E) 

C2 Foreach t T 
C3 Instantiate^lice(t, T) 

C4 Transplant_slice(P, Ct, tt) 

Endforeach 

C5 Instantiate_program(P, T) 

C6 Foreach loop P 
C7 Remove_Redundant_Comp(Zoopj 
C7 Generate_Comm(/oopJ 

Endforeach 

C8 Return P 



Fig. 3.5. Code generation algorithm. 



The slice graph construction algorithm was mainly concerned with where 
to precompute which subscript traces, and in what order. Before generating 
code, however, we must decide what data structures to use for first recording 
the traces that prefetch nonlocal data, and then accessing the prefetched 
data. The example presented in Section 3.1 used temporary trace arrays 
for performing both of these operations. If there are repeated references to 
the same array elements, it may be profitable to compress the trace used in 
prefetching so that only one copy of each element is communicated. When we 
generate the statements ct for a transplant descriptor t, we therefore postpone 
inserting the code for manipulating these data structures; we do not include 
initialization and incrementing of counters, nor assignments of values into 
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trace arrays. Instead, we mark place holders for those operations and delay 
generation of that code until the slice instantiation phase during final code 
generation. 

Topological_sort() performs a topological sort of the nodes in the slice 
graph so that the partial order given by the edges in E is maintained dur- 
ing code generation for the transplant descriptors in T. The Foreach state- 
ment labeled C2 iterates through the transplant descriptors T in topologi- 
cal order. Instantiate_slice() takes a transplant descriptor t as input, re- 
places subscript references for which preprocessing has already been gener- 
ated, and also generates code for collecting the subscript that t is slicing on. 
After the slice for t has been instantiated. Transplant _slice() inserts the 
code for the slice, c*, into the program at the target location tt- The trans- 
formed program is returned to the calling procedure. The function Instan- 
tiate_program() is similar to the function Instantiate_slice(). It takes as 
input the program P and the set of transplant descriptors T, and replaces 
preprocessed subscripts in P accesses into the data structures defined in 
the preprocessing phase. The program instantiation depends on what type of 
data structure was used to store the trace of subscripts in the collecting trans- 
plant descriptors. The Foreach statement labeled C6 iterates through all the 
loops in the program and performs two operations. First, the procedure Re- 
move_Redundant_Comp() performs common subexpression elimination, 
so that redundant computation is not performed. The procedure Gener- 
ate_Comm() takes as input the code for a loop (with only at most a single 
level of indirection in any distributed array reference) and inserts all the calls 
to the runtime support library to perform the required communication and 
global to local translation. 



4. Experiments 

In this section we present performance results from applications parallelized 
by hand and by the compiler using our transformations. The main paral- 
lelized application is a molecular dynamics code called CHARMM (Chem- 
istry at HARvard Macromolecular Mechanics) . The application program had 
previously been ported to distributed memory parallel machines using the 
CHAOS runtime library routines. 

4.1 Hand Parallelization with CHAOS 
Overview 

CHARMM is a program that calculates empirical energy functions to model 
macromolecular systems. The purpose of CHARMM is to derive structural 
and dynamic properties of molecules using the first and second order deriva- 
tive techniques [4]. 
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The computationally intensive part of CHARMM is the molecular dynam- 
ics simulation. The computation simulates the dynamic interactions among 
all atoms in the system for a period of time. For each time step, the simu- 
lation calculates the forces between atoms, the energy of the entire system, 
and the motion of atoms via integrating Newton’s equations of motion. The 
simulation then updates the spatial positions of the atoms. The positions of 
the atoms are fixed during the energy calculation; however, they are updated 
when spatial displacements due to atomic forces are calculated. 

The loop structure of the molecular dynamics simulation is shown in 
Figure 4.1. The physical values associated with atoms, such as velocity, force 
and displacement, are accessed using indirection arrays (IB, JB, etc.). The 
energy calculations in the molecular dynamics simulation consist of two types 
of interactions - bonded and non-bonded. 

Bonded forces exist between atoms connected by chemical bonds. 
CHARMM calculates four types of bonded forces - bond potential, bond an- 
gle potential, dihedral angle (torsion) potential, and improper torsion. These 
forces are short-range; the forces exist between atoms that he close to each 
other in space. Bonded interactions remain unchanged during the entire sim- 
ulation process because the chemical bonds in the system do not change. 
The complexity of bonded forces calculations is approximately linear in the 
number of atoms, because each atom has a finite number of bonds with other 
atoms. 

Non-bonded forces are the van der Waals interactions and electrostatic po- 
tential between all pairs of atoms. The time complexity of non-bonded forces 
calculations is 0{N‘^), because each atom interacts with all other atoms in 
the system. In simulating large molecular structures, CHARMM approxi- 
mates the non-bonded force calculation by ignoring all interactions beyond a 
certain cutoff radius. This approximation is done by generating a non-bonded 
list, stored in array JNB, that contains all pairs of interactions within the 
cutoff radius. The spatial positions of the atoms change after a time step, so 
the non-bonded list must also be regenerated. However, in CHARMM, the 
user has control over non-bonded list regeneration frequency. If the atoms 
do not move far in each time step, not generating the non-bonded list every 
time step will not significantly affect the results of the simulation. 

Parallelization Approach 

Data Partitioning Spatial information is associated with each atom. 
Bonded interactions occur between atoms in close proximity to each other. 
Non-bonded interactions are excluded beyond a cutoff radius. Additionally, 
the amount of computation associated with an atom depends on the num- 
ber of atoms with which it interacts - the number of JNB (non-bonded list) 
entries for that atom. The way in which the atoms are numbered frequently 
does not have a useful correspondence to the interaction patterns within the 
molecule. A naive data distribution across processors of the arrays storing 
information for the atoms, such as BLOCK or CYCLIC, may result in a high 
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LI: DO N = 1, nsteps 

Regenerate non-bonded list if required 

C Bonded Force Calculations 

L2: DO I = 1, NBONDS 

Calculate force between atoms IB (I) and JB(I) 

END DO 

L3: DO I = 1, NANGLES 

Calculate angle potential of atoms IT (I), JT(I), and KT(I) 
END DO 

C Non-Bonded Force Calculation 

L4: DO I = 1, NATOMS 

DO J = INBLO(I)+l, INBLO(I+l) 

Calculate force between atoms I and JNB(J) 

END DO 
END DO 

Integrate Newton’s Equations and Update Atom Coordinates 
END DO 

Fig. 4.1. Molecular Dynamics Simulation Code from CHARMM 



volume of communication and poor load balance. Hence, data partitioners 
such as recursive coordinate bisection (RCB) [3] and recursive inertial bisec- 
tion (RIB) [25], which use spatial information as well as computational load, 
are good candidates to effectively partition atoms across the processors. All 
data arrays that are associated with the atoms should be distributed in an 
identical manner. 

Iteration Partitioning Once the atoms are partitioned, the data distri- 
bution can be used to determine how loop iterations are partitioned among 
the processors. Each iteration of the bonded force calculations is assigned to 
the processor that has the maximum number of local array elements. If the 
choice of processor is not unique (two processors own the same number of 
atoms required for the loop iteration), the processor with the lowest com- 
putational load is chosen. Bonded force calculations consume only about 1% 
of the total execution time for the complete energy calculation. Non-bonded 
force calculations consume 90% of the execution time. Hence, balancing the 
computational load due to non-bonded calculations is of primary concern. To 
balance the load, the non-bonded force calculation for an atom is assigned 
to the processor that owns the data for the atom, since the atoms are dis- 
tributed using both geometrical and computational load information. Hence, 
each iteration of the outer loop, labeled L4 in Figure 4.1, is assigned to the 
processor that owns the atom. 
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Remapping and Loop Preprocessing Once the distributions of the data 
and loop iterations are determined, CHAOS library routines can be used to 
remap the data and indirection arrays from the one distribution to another 
distribution (to implement the load balancing algorithm). After remapping, 
loop preprocessing is carried out to translate global to local references and 
to generate communication schedules for exchanging data among processors. 

The indirection arrays used in the bonded force calculation loops remain 
unchanged, but the non-bonded list changes during computation. Therefore 
the preprocessing for the bonded force calculation loops need not be repeated, 
whereas it must be repeated for the non-bonded force calculation loops when- 
ever the non-bonded list changes. In this case, the CHAOS hash table and 
stamps are useful for loop preprocessing. When schedules are built, indirec- 
tion arrays are hashed with unique time stamps. The hash table is used to 
remove any duplicate off-processor references. When the non-bonded list is 
regenerated, non-bonded list entries in the hash table are cleared using the 
corresponding stamp. Then the same stamp can be reused and the new non- 
bonded list entries are hashed with the reused stamp. 

Performance 

The performance of the molecular dynamics simulations was studied with 
a benchmark input data set (MbCO + 3830 water molecules) on the Intel 
iPSC/860. The simulation ran for 1000 time steps, performing 40 non-bonded 
list regenerations. The cutoff radius for the non-bonded list generation was 14 
A. The performance results are presented in Table 4.1. The RGB partitioner 
was used to partition the atoms. The execution time includes the energy 
calculation time and the communication time for each processor. The com- 
putation time shown is the average of the computation time of the over all 
processors, and the communication time shown is the average communication 
time. The load balance index was calculated as 

(max(T^ computation time of processor i) ^{number of processors n) 
computation time of processor i 

The results show that CHARMM scales well and that good load balance was 
maintained up to 128 processors. 



Table 4.1. Performance of Parallel CHARMM on Intel iPSC/860 (in sec.) 



Number of Processors 


1 


16 


32 


64 


128 


Execution Time 


74595.5 


4356.0 


2293.8 


1261.4 


781.8 


Computation Time 


74595.5 


4099.4 


2026.8 


1011.2 


507.6 


Communication Time 


0.0 


147.1 


159.8 


181.1 


219.2 


Load Balance Index 


1.00 


1.03 


1.05 


1.06 


1.08 



* Estimated time computed by Brooks and Hodoscek [5] 
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Preprocessing Overheads Data and iteration partitioning, remapping, 
and loop preprocessing must be done at runtime for the parallel simulation. 
The preprocessing overheads incurred are shown in Table 4.2. The data par- 
tition time is the execution time of the RGB partitioner. After partitioning 
the atoms, the non-bonded list is regenerated. This initial non-bonded list 
regeneration is performed because at the beginning of the simulation the ini- 
tial distribution of atoms is by blocks across processors, and the atoms must 
be redistributed according to the results of the partitioner. In Table 4.2, the 
regeneration time is shown as the non-bonded list update time. 

In the course of the simulation, the non-bonded list is regenerated peri- 
odically. When the non-bonded list is updated, the corresponding commu- 
nication schedule must be regenerated. The schedule regeneration time in 
Table 4.2 shows the total schedule regeneration time for 40 non-bonded list 
updates. By comparing these numbers to those in Table 4.1, we see that the 
preprocessing overhead is small compared to the total execution time. 



Table 4.2. Preprocessing Overheads of CHARMM for 1000 iterations (in 
sec.) 



Number of Processors 


16 


32 


64 


128 


Data Partition 


0.27 


0.47 


0.83 


1.63 


Non-bonded List Update 


7.18 


3.85 


2.16 


1.22 


Remapping and Preprocessing 


0.03 


0.03 


0.02 


0.02 


Schedule Generation 


1.31 


0.80 


0.64 


0.42 


Schedule Regeneration 


43.51 


23.36 


13.18 


8.92 



4.2 Compiler Parallelization Using CHAOS 

We have implemented our loop transformation algorithm as part of the Rice 
Fortran D compiler [18]. We have successfully parallelized a number of kernels 
derived from various irregular applications. These applications are structured 
such that they cannot be parallelized using previous compilation techniques 
without a severe degradation in performance. The only other known auto- 
matic method that can be used to parallelize these kernels is runtime resolu- 
tion [27], but that technique generates parallel code with poor performance, 
since each off-processor reference is requires a separate communication oper- 
ation. 

In this section we present performance results from two kernels that were 
parallelized using our techniques. Tables 4.3, 4.4 and 4.5 show performance 
measurements for compiler-generated code as well as for programs paral- 
lelized by manually inserting CHAOS library calls. We have invested sub- 
stantial effort in hand optimizing the full parallel application codes from 
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which these kernels were extracted [8,9]. We are therefore confident that 
the hand parallelized kernels are well-optimized. These performance results 
are presented to demonstrate that, in the cases we have studied, the per- 
formance obtained by using the compiler transformations described in this 
paper is close to the performance of a good hand parallelized version. For 
both kernels the hand parallelized code performed better than the compiler 
generated code for two reasons: decreased interprocessor communication vol- 
ume and fewer messages exchanged between processors. 



Table 4.3. Timings From CHARMM Kernel(100 iterations)- iPSC/860 (in 
sec., 648 atoms) 



Processors 


Code Generation 


1 Block Partitioning 


1 Recursive Coordinate Bisection j 


Hand (optimized) 


Compiler 


Hand (optimized) 


Compiler 


2 


20.5 


26.3 


16.9 


21.2 


4 


16.8 


20.1 


12.6 


15.8 


8 


13.1 


17.6 


10.1 


11.1 


16 


11.3 


15.8 


8.7 


9.6 


32 


12.5 


15.8 


9.2 


10.7 



Table 4.4. Timings From CHARMM Kernel(100 iterations)- iPSC/860 (in 
sec., 14023 atoms) 



Processors 


Code Generation 


Recursive Coordinate Bisection 


Hand (optimized) 


Compiler 


16 


488.1 


521.8 


32 


308.4 


338.0 


64 


202.8 


225.2 


128 


108.3 


133.7 



One of the kernels was extracted from CHARMM [4]. We compare the 
compiler parallelized versions against the optimized hand parallelized code 
in Tables 4.3 and 4.4. In the hand parallelized code both the loop iterations 
and the data were block partitioned. For the compiler parallelized version 
we tried both block and recursive coordinate bisection data partitioning. 
We have used a Fortran D on-clause [12] to override the default iteration 
space partition, instead using the iteration partitioning strategy previously 
described for CHARMM. The input data set for the results shown in Table 4.3 
was small (648 atoms), hence beyond 16 processors there is no reduction in 
computation time from using more processors. A larger input data set was 
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Table 4.5. Timings From EUL3D Kernel(100 iterations)- iPSC/860 (in sec.). 



Processors 


Code Generation 


Hand (optimized) 


Compiler 


2 


21.0 


27.7 


4 


11.5 


16.2 


8 


6.8 


10.3 


16 


4.7 


7.7 


32 


4.8 


7.9 



used for the results shown in Table 4.4, and the speedups achieved are much 
better. 

The other kernel was extracted from a unstructured grid Euler solver, 
EUL3D [23]. For this kernel, the data for both the hand and the compiler 
parallelized versions were partitioned using recursive coordinate bisection 
(RGB). We used the on-clause directive to partition the iteration space to 
get good load balance. The input to the kernel was an unstructured mesh 
(9428 points) from an aerodynamics simulation. 

We also performed experiments on a Cray Y-MP. The kernel extracted 
from the molecular dynamics code was used as the test case, and the small 
input data set was used. To obtain the Cray-YMP code we performed the 
compiler transformations by hand. Using those techniques, the average time 
for a single iteration was reduced from 0.38 seconds to 0.32 seconds. The 
preprocessing time for the transformed code was 0.093 seconds. 



5. Conclusions 

The techniques that we have presented in this paper can be used by a com- 
piler to transform application programs with complex array access patterns 
(meaning multiple levels of indirect access) into programs with arrays only 
indexed by a single level of indirection. We have shown how the techniques 
can be utilized to automatically generate parallel code for static and adap- 
tive irregular problems. We have implemented the transformation algorithm 
presented in Section 3.3 and have obtained encouraging performance results 
on several adaptive irregular codes. 

We have also discussed new runtime support optimizations, lightweight 
schedules and two-phase schedule generation, that arc required to efficiently 
parallelize adaptive irregular programs on distributed memory parallel ma- 
chines. While this paper did not focus extensively on these runtime support 
issues, the compiler techniques implicitly assume the existence of such run- 
time support [10,20,28]. 

While we have presented the compiler transformation techniques in the 
context of optimizing communication in a distributed memory parallel en- 
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vironment, the techniques can also be used by a compiler to generate code 
to optimize parallel I/O, and to prefetch data and vectorize operations on 
architectures that have multi-level memory hierarchies. 
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